Some interesting Unicode URLs have recently been seen used in the wild – like in this billboard ad campaign from Coca Cola, and a friend of mine asked me about curl in reference to these and how it deals with such URLs.
(Picture by stevencoleuk)
I ran some tests and decided to blog my observations since they are a bit curious. The exact URL I tried was ‘www.ðŸ˜ƒ.ws’ (not the same smiley as shown on this billboard: ðŸ˜‚) – it is really hard to enter by hand so now is the time to appreciate your ability to cut and paste! It appears they registered several domains for a set of different smileys.
These smileys are not really allowed IDN (where IDN means International Domain Names) symbols which make these domains a bit different. They should not (see below for details) be converted to punycode before getting resolved but instead I assume that the pure UTF-8 sequence should or at least will be fed into the name resolver function. Well, either way it should either pass in punycode or the UTF-8 string.
If curl was built to use libidn, it still won’t convert this to punycode and the verbose output says “Failed to convert www.ðŸ˜ƒ.ws to ACE; String preparation failed”
curl (exact version doesn’t matter) using the stock threaded resolver
- Debian Linux (glibc 2.19) – FAIL
- Windows 7 – FAIL
- Mac OS X 10.9 – SUCCESS
But then also perhaps to no surprise, the exact same results are shown if I try to ping those host names on these systems. It works on the mac, it fails on Linux and Windows. Wget 1.16 also fails on my Debian systems (just as a reference and I didn’t try it on any of the other platforms).
My curl build on Linux that uses c-ares for name resolving instead of glibc succeeds perfectly. host, nslookup and dig all work fine with it on Linux too (as well as nslookup on Windows):
$ host www.ðŸ˜ƒ.ws www.\240\159\152\131.ws has address 22.214.171.124$ ping www.ðŸ˜ƒ.ws ping: unknown host www.ðŸ˜ƒ.ws
While the same command sequence on the mac shows:
$ host www.ðŸ˜ƒ.ws www.\240\159\152\131.ws has address 126.96.36.199$ ping www.ðŸ˜ƒ.ws PING www.ðŸ˜ƒ.ws (188.8.131.52): 56 data bytes 64 bytes from 184.108.40.206: icmp_seq=0 ttl=44 time=191.689 ms 64 bytes from 220.127.116.11: icmp_seq=1 ttl=44 time=191.124 ms
Slightly interesting additional tidbit: if I rebuild curl to use gethostbyname_r() instead of getaddrinfo() it works just like on the mac, so clearly this is glibc having an opinion on how this should work when given this UTF-8 hostname.
Pasting in the URL into Firefox and Chrome works just fine. They both convert the name to punycode and use “www.xn--h28h.ws” which then resolves to the same IPv4 address.
Update: as was pointed out in a comment below, the “18.104.22.168” IP address is not the correct IP for the site. It is just the registrar’s landing page so it sends back that response to any host or domain name in the .ws domain that doesn’t exist!
What do the IDN specs say?
This is not my area of expertise. I had to consult Patrik FÃ¤ltstrÃ¶m here to get this straightened out (but please if I got something wrong here the mistake is still all mine). Apparently this smiley is allowed in RFC 3940 (IDNA2003), but that has been replaced by RFC 5890-5892 (IDNA2008) where this is DISALLOWED. If you read the spec, this is 263A.
So, depending on which spec you follow it was a valid IDN character or it isn’t anymore.
What does the libc docs say?
The POSIX docs for getaddrinfo doesn’t contain enough info to tell who’s right but it doesn’t forbid UTF-8 encoded strings. The regular glibc docs for getaddrinfo also doesn’t say anything and interestingly, the Apple Mac OS X version of the docs says just as little.
With this complete lack of guidance, it is hardly any additional surprise that the glibc gethostbyname docs also doesn’t mention what it does in this case but clearly it doesn’t do the same as getaddrinfo in the glibc case at least.
What’s on the actual site?
A redirect to www.emoticoke.com which shows a rather boring page.
I don’t know. What do you think?