Some interesting Unicode URLs have recently been seen used in the wild – like in this billboard ad campaign from Coca Cola, and a friend of mine asked me about curl in reference to these and how it deals with such URLs.
(Picture by stevencoleuk)
I ran some tests and decided to blog my observations since they are a bit curious. The exact URL I tried was ‘www.O.ws’ (not the same smiley as shown on this billboard – note that I’ve replace the actual smiley with “O” in this entire post since wordpress craps on it) – it is really hard to enter by hand so now is the time to appreciate your ability to cut and paste! It appears they registered several domains for a set of different smileys.
These smileys are not really allowed IDN (where IDN means International Domain Names) symbols which make these domains a bit different. They should not (see below for details) be converted to punycode before getting resolved but instead I assume that the pure UTF-8 sequence should or at least will be fed into the name resolver function. Well, either way it should either pass in punycode or the UTF-8 string.
If curl was built to use libidn, it still won’t convert this to punycode and the verbose output says “Failed to convert www.O.ws to ACE; String preparation failed”
curl (exact version doesn’t matter) using the stock threaded resolver
- Debian Linux (glibc 2.19) – FAIL
- Windows 7 – FAIL
- Mac OS X 10.9 – SUCCESS
But then also perhaps to no surprise, the exact same results are shown if I try to ping those host names on these systems. It works on the mac, it fails on Linux and Windows. Wget 1.16 also fails on my Debian systems (just as a reference and I didn’t try it on any of the other platforms).
My curl build on Linux that uses c-ares for name resolving instead of glibc succeeds perfectly. host, nslookup and dig all work fine with it on Linux too (as well as nslookup on Windows):
$ host www.O.ws www.O.ws has address 126.96.36.199$ ping www.O.ws ping: unknown host www.O.ws
While the same command sequence on the mac shows:
$ host www.O.ws www.O.ws has address 188.8.131.52$ ping www.O.ws PING www.O.ws (184.108.40.206): 56 data bytes 64 bytes from 220.127.116.11: icmp_seq=0 ttl=44 time=191.689 ms 64 bytes from 18.104.22.168: icmp_seq=1 ttl=44 time=191.124 ms
Slightly interesting additional tidbit: if I rebuild curl to use gethostbyname_r() instead of getaddrinfo() it works just like on the mac, so clearly this is glibc having an opinion on how this should work when given this UTF-8 hostname.
Pasting in the URL into Firefox and Chrome works just fine. They both convert the name to punycode and use “www.xn--h28h.ws” which then resolves to the same IPv4 address.
Update: as was pointed out in a comment below, the “22.214.171.124” IP address is not the correct IP for the site. It is just the registrar’s landing page so it sends back that response to any host or domain name in the .ws domain that doesn’t exist!
What do the IDN specs say?
This is not my area of expertise. I had to consult Patrik Fältström here to get this straightened out (but please if I got something wrong here the mistake is still all mine). Apparently this smiley is allowed in RFC 3940 (IDNA2003), but that has been replaced by RFC 5890-5892 (IDNA2008) where this is DISALLOWED. If you read the spec, this is 263A.
So, depending on which spec you follow it was a valid IDN character or it isn’t anymore.
What does the libc docs say?
The POSIX docs for getaddrinfo doesn’t contain enough info to tell who’s right but it doesn’t forbid UTF-8 encoded strings. The regular glibc docs for getaddrinfo also doesn’t say anything and interestingly, the Apple Mac OS X version of the docs says just as little.
With this complete lack of guidance, it is hardly any additional surprise that the glibc gethostbyname docs also doesn’t mention what it does in this case but clearly it doesn’t do the same as getaddrinfo in the glibc case at least.
What’s on the actual site?
A redirect to www.emoticoke.com which shows a rather boring page.
I don’t know. What do you think?
6 thoughts on “curl, smiley-URLs and libc”
See https://annevankesteren.nl/2014/06/url-unicode and its pointers for when I sorted this out for browsers and clients that want to visit the same sites as browsers such as curl.
Just a detail, you say the character is 263A but in your url the character used is actually 1F603:
Thanks, you’re helping me prove that I should just stay away from this Unicode thing! =)
Note that 126.96.36.199 is not part of the coke campaign. All unregistered domains under .ws resolve to 188.8.131.52, serving a registrar landing page. That landing page is also served for non-punycode mis-encodings of emojis. Domains part of the coke campaign resolve to 184.108.40.206 instead. Also note that http://www.xn-h28h.ws is not the correct IDN of that emoji, it should be http://www.xn--h28h.ws (two dashes) which does not resolve to the same IP address.
Thanks Daniel, that changed things a bit! (the single-dash was just my typo – I fixed it now.)
So a registrar landing page! That actually made the whole thing slightly different since then in the end they (probably) hadn’t registered the UTF-8 versions of those names!
But it also makes it more curious what the failed getaddrinfo() does. As it then probably doesn’t even try to resolve the name and just reject it based on the input!
On any platform, it should work for the application to apply ToASCII to a domain name and then send it to a traditional name-resolution API such as getaddrinfo (POSIX / RFC2553). That is, if you do not know that your resolver has an “IDN-aware” mode, then it is “IDN-unaware” and you MUST pass only ASCII characters.
On the other hand, some resolvers like glibc’s getaddrinfo supply a flag which can be used to switch the resolver into IDN mode (AT_IDN). In that case, though, the input string has to be in the character encoding selected by setlocale(), not necessarily UTF-8. If AT_IDN is available, and the domain name can be encoded in the user’s locale, then you could pass that encoded string instead of the ASCII string.
But once you have to write those tests (one at compile-time and one at runtime) and see that you might have to fall back to the RFC2553 ToASCII algorithm anyway, it seems that the only sane thing is to do it all the time.
Comments are closed.