Let me introduce you to what I consider one of the worst hacks we have in current and modern internet protocols: the Public Suffix List (PSL). This is a list (maintained by Mozilla) with domains that have some kind administrative setup or arrangement that makes sub-domains independent. For example, you can’t be allowed to set cookies for “*.com” because .com is a TLD that has independent domains. But the same thing goes for “*.co.uk” and there’s no hint anywhere about this – except for the Public Suffix List. Then, take that simple little example and extrapolate to a domain system that grows with several new TLDs every month and more. The PSL is now several thousands of entries long.
And cookies isn’t the only thing this is used for. Another really common and perhaps even more important use case is for wildcard matches in TLS server certificates. You should not be allowed to buy and use a cert for “*.co.uk” but you can for “*.yourcompany.co.uk”…
Not really official but still…
If you read the cookie RFC or the spec for how to do TLS wildcard certificate matching you won’t read any line putting it crystal clear that the Suffix List is what you must use and I’m sure different browser solve this slightly differently but in practice and most unfortunately (if you ask me) you must either use the list or make your own to be fully compliant with how the web works 2014.
curl, wget and the PSL
In curl and libcurl, we have so far not taken the PSL into account which is by choice since I’ve not had any decent way to handle it and there are lots of embedded and other use cases that simply won’t be able to cope with that large PSL chunk.
Wget hasn’t had any PSL awareness either, but the recent weeks this has been brought up on the wget list and more attention has been given to this. Work has been initiated to do something about it, which has lead to…
libpsl
Tim Rühsen took the baton and started the libpsl project and its associated mailing list, as a foundation for something for Wget to use to get PSL awareness.
I’ve mostly cheered the effort so far and said that I wouldn’t mind building on this to enhance curl in the future if it just gets a suitable (liberal enough) license and it seems to go in that direction. For curl’s sake, I would like to get a conditional dependency on this so that people without particular size restrictions can use this, and people on more embedded and special-purpose situations can continue to build without PSL support.
If you’re interested in helping out in curl and libcurl in this area, feel most welcome!
dbound
Meanwhile, the IETF has set up a new mailing list called dbound for discussions around PSL and similar issues and it seems very timely!
Yep, this stuff looks fairly simple but it is not – you have to keep trailing dots in mind, recognize raw IP addresses and support punycode. We implemented all of it in JavaScript a while back, our implementation is available under https://hg.adblockplus.org/adblockpluschrome/file/4731701d573e/lib/basedomain.js#l65 (plus https://hg.adblockplus.org/adblockpluschrome/file/4731701d573e/lib/publicSuffixList.js which is being generated automatically from the original PSL, plus https://hg.adblockplus.org/adblockpluschrome/file/4731701d573e/lib/punycode.js for punycode support).
I actually wrote faup https://github.com/stricaud/faup/ with a fairly permissive license to be able to embed this kind of feature easily (even with a very permissive license).
I do use the Public Suffix List by default. I wrote python bindings and you can play around modules using lua.
It is as simple as typing: faup http://www.example.co.uk to have a csv with the fields separated; faup -o json http://www.example.co.uk to get the json output. If you only want to retrieve the tld, you can add the -f tld option.
I am more than open to criticisms. The intent was to have a simple command line tool to do the job, as well as a library that could be embedded easily. Every bug has a ticket, every ticket with a testing thing is tested, to avoid reproducing those in the future.
@Sebastien: oh, nice. I didn’t know about that and it seems the wget team didn’t really either.
It’d be great if you’d join up on the libpsl list and help out there. Wget and (lib)curl are only really interested in public suffix support so everything that isn’t about PSL would be considered extra “cruft” which I think could be a reason to have it as a stand-alone PSL-specific library.
I’m mostly cheering the efforts so far, not actually participating very much myself!