When I started the precursor to the curl project, httpget, back in 1996, I wrote my first URL parser. Back then, the universal address was still called URL: Uniform Resource Locators. That spec was published by the IETF in 1994. The term “URL” was then used as source for inspiration when naming the tool and project curl.
The term URL was later effectively changed to become URI, Uniform Resource Identifiers (published in 2005) but the basic point remained: a syntax for a string to specify a resource online and which protocol to use to get it. We claim curl accepts “URLs” as defined by this spec, the RFC 3986. I’ll explain below why it isn’t strictly true.
There was also a companion RFC posted for IRI: Internationalized Resource Identifiers. They are basically URIs but allowing non-ascii characters to be used.
The WHATWG consortium later produced their own URL spec, basically mixing formats and ideas from URIs and IRIs with a (not surprisingly) strong focus on browsers. One of their expressed goals is to “Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process“. They want to go back and use the term “URL” as they rightfully state, the terms URI and IRI are just confusing and no humans ever really understood them (or often even knew they exist).
The WHATWG spec follows the good old browser mantra of being very liberal in what it accepts and trying to guess what the users mean and bending backwards trying to fulfill. (Even though we all know by now that Postel’s Law is the wrong way to go about this.) It means it’ll handle too many slashes, embedded white space as well as non-ASCII characters.
From my point of view, the spec is also very hard to read and follow due to it not describing the syntax or format very much but focuses far too much on mandating a parsing algorithm. To test my claim: figure out what their spec says about a trailing dot after the host name in a URL.
On top of all these standards and specs, browsers offer an “address bar” (a piece of UI that often goes under other names) that allows users to enter all sorts of fun strings and they get converted over to a URL. If you enter “http://localhost/%41” in the address bar, it’ll convert the percent encoded part to an ‘A’ there for you (since 41 in hex is a capital A in ASCII) but if you type “http://localhost/A A” it’ll actually send “/A%20A” (with a percent encoded space) in the outgoing HTTP GET request. I’m mentioning this since people will often think of what you can enter there as a “URL”.
The above is basically my (skewed) perspective of what specs and standards we have so far to work with. Now we add reality and let’s take a look at what sort of problems we get when my URL isn’t your URL.
So what is a URL?
Or more specifically, how do we write them. What syntax do we use.
I think one of the biggest mistakes the WHATWG spec has made (and why you will find me argue against their spec in its current form with fierce conviction that they are wrong), is that they seem to believe that URLs are theirs to define and work with and they limit their view of URLs for browsers, HTML and their address bars. Sure, they are the big companies behind the browsers almost everyone uses and URLs are widely used by browsers, but URLs are still much bigger than so.
The WHATWG view of a URL is not widely adopted outside of browsers.
colon-slash-slash
If we ask users, ordinary people with no particular protocol or web expertise, what a URL is what would they answer? While it was probably more notable years ago when the browsers displayed it more prominently, the :// (colon-slash-slash) sequence will be high on the list. Seeing that marks the string as a URL.
Heck, going beyond users, there are email clients, terminal emulators, text editors, perl scripts and a bazillion other things out there in the world already that detects URLs for us and allows operations on that. It could be to open that URL in a browser, to convert it to a clickable link in generated HTML and more. A vast amount of said scripts and programs will use the colon-slash-slash sequence as a trigger.
The WHATWG spec says it has to be one slash and that a parser must accept an indefinite amount of slashes. “http:/example.com” and “http:////////////////////////////////////example.com” are both equally fine. RFC 3986 and many others would disagree. Heck, most people I’ve confronted the last few days, even people working with the web, seem to say, think and believe that a URL has two slashes. Just look closer at the google picture search screen shot at the top of this article, which shows the top images for “URL” google gave me.
We just know a URL has two slashes there (and yeah, file: URLs most have three but lets ignore that for now). Not one. Not three. Two. But the WHATWG doesn’t agree.
“Is there really any reason for accepting more than two slashes for non-file: URLs?” (my annoyed question to the WHATWG)
“The fact that all browsers do.”
The spec says so because browsers have implemented the spec.
No better explanation has been provided, not even after I pointed out that the statement is wrong and far from all browsers do. You may find reading that thread educational.
In the curl project, we’ve just recently started debating how to deal with “URLs” having another amount of slashes than two because it turns out there are servers sending back such URLs in Location: headers, and some browsers are happy to oblige. curl is not and neither is a lot of other libraries and command line tools. Who do we stand up for?
Spaces
A space character (the ASCII code 32, 0x20 in hex) cannot be part of a URL. If you want it sent, you percent encode it like you do with any other illegal character you want to be part of the URL. Percent encoding is the byte value in hexadecimal with a percent sign in front of it. %20 thus means space. It also means that a parser that for example scans for URLs in a text knows that it reaches the end of the URL when the parser encounters a character that isn’t allowed. Like space.
Browsers typically show the address in their address bars with all %20 instances converted to space for appearance. If you copy the address there into your clipboard and then paste it again in your text editor you still normally get the spaces as %20 like you want them.
I’m not sure if that is the reason, but browsers also accept spaces as part of URLs when for example receiving a redirect in a HTTP response. That’s passed from a server to a client using a Location: header with the URL in it. The browsers happily allow spaces in that URL, encode them as %20 and send out the next request. This forced curl into accepting spaces in redirected “URLs”.
Non-ASCII
Making URLs support non-ASCII languages is of course important, especially for non-western societies and I’ve understood that the IRI spec was never good enough. I personally am far from an expert on these internationalization (i18n) issues so I just go by what I’ve heard from others. But of course users of non-latin alphabets and typing systems need to be able to write their “internet addresses” to resources and use as links as well.
In an ideal world, we would have the i18n version shown to users and there would be the encoded ASCII based version below, to get sent over the wire.
For international domain names, the name gets converted over to “punycode” so that it can be resolved using the normal system name resolvers that know nothing about non-ascii names. URIs have no IDN names, IRIs do and WHATWG URLs do. curl supports IDN host names.
WHATWG states that URLs are specified as UTF-8 while URIs are just ASCII. curl gets confused by non-ASCII letters in the path part but percent encodes such byte values in the outgoing requests – which causes “interesting” side-effects when the non-ASCII characters are provided in other encodings than UTF-8 which for example is standard on Windows…
Similar to what I’ve written above, this leads to servers passing back non-ASCII byte codes in HTTP headers that browsers gladly accept, and non-browsers need to deal with…
No URL standard
I’ve not tried to write a conclusive list of problems or differences, just a bunch of things I’ve fallen over recently. A “URL” given in one place is certainly not certain to be accepted or understood as a “URL” in another place.
Not even curl follows any published spec very closely these days, as we’re slowly digressing for the sake of “web compatibility”.
There’s no unified URL standard and there’s no work in progress towards that. I don’t count WHATWG’s spec as a real effort either, as it is written by a closed group with no real attempts to get the wider community involved.
My affiliation
I’m employed by Mozilla and Mozilla is a member of WHATWG and I have colleagues working on the WHATWG URL spec and other work items of theirs but it makes absolutely no difference to what I’ve written here. I also participate in the IETF and I consider myself friends with authors of RFC 1738, RFC 3986 and others but that doesn’t matter here either. My opinions are my own and this is my personal blog.
Maybe you can convince Mozilla to do some telemetry on this, so we can see how rare ///+ really is.
I could probably, yes. It would show that a very minimal X% of all URLs/Location: headers use absolute URLs with something other than two slashes. We already know this (just based on the fact that we don’t get this problem reported more often in the curl project). I honestly don’t think the WHATWG cares much if that X is 0.05% or 0.2%…
The WHATWG will follow if browsers manage to simplify their URL parsing changes, and Blink does care between 0.01% and 0.1%. See http://www.chromium.org/blink#TOC-Launch-Process:-Deprecation, in particular the “usage percentages” link.
Just a nitpick: The “two slashes” mark just a subset of URLs, the hierarchical ones. Examples of no-slash are `mailto:` and `data:`.
You’re of course entirely correct Paul. My focus on the two-slash point was entirely on FTP, HTTP and HTTPS URLs that all (used to?) have two slashes. I could probably have expressed that a bit clearer, but I think the point got through nonetheless.
According to the generic syntax spec (RFC 3986) the “//” indicates the authority, when there is one. “mailto:” doesn’t have an authority, “http:” does, “file:” …uh, is special in how it treats the authority. Having a hierarchical path or not is orthogonal to having an authority.
File URLs with three slashes have a blank server name, which is implicitly the local machine. In theory file:///path/to/file should be the same as file:/path/to/file (and file:path/to/file ought to be allowed for cwd-relative addressing). Furthermore, you should be able to specify file://servername/path/to/file to access a remote file, but that rarely works properly.
This is never handled correctly, mostly because most developers don’t understand what URLs are and think the :// is some kind of magic sequence that demarks a URL – it isn’t, only the colon is with the most fundamental form of a URL being :. This leads to monstrosities like “file://////server/path/to/share” which I see on Windows machines all the time. *sigh*
“The spec says so because browsers have implemented the spec.”
You have the causality reversed. The WHATWG spec is able to deal with a number of slashes that Web devs shouldn’t write because there were browsers that already dealt with it.
“This forced curl into accepting[…]”
It seems that curl is a non-browser piece of software that tries to work with the real Web that’s mainly tested to work with browsers. I think it’s sad that you “don’t count WHATWG’s spec as a real effort either” when curl could benefit from the WHATWG efforts by implementing an URL parser according to the WHATWG spec instead of patching compatibility problem piecemeal as they are reported as curl bugs.
Henri, “my causality” was the conclusion from the answer I got when I asked about the reasoning for the slashes. And yes, I agree it was slightly disingenuously phrased, but the response to my question was dismissive and rather silly if I might say.
If curl would support the whatwg URL spec is of course going to continue being debated and discussed and maybe one day we will support most of what browsers do. But we have a world full of other command line tools, libraries and scripts parse “URLs” too. I think a real and proper URL spec should consider the entire ecosystem. I don’t think you whatwg people have proven yourself very good at that.
“But we have a world full of other command line tools, libraries and scripts parse “URLs” too.”
Yes, but random Web devs who make the sort of mistakes like having the wrong number of slashes in a URL or not escaping spaces are more likely to have tested what they made in the browser they were using at the time than in non-browsers tools (even super-popular ones like curl).
This phenomenon is why in the WHATWG world view, when writing a spec with the goal of precise implementations of the spec being able to process the Web as it exists, it’s more important to be able to consume what IE6 was able to consume than to match what curl does. (I think this world view is empirically correct.)
And then non-browser tools that wish to process the Web as it exists can benefit from the WHATWG specs, too.
“I think a real and proper URL spec should consider the entire ecosystem. I don’t think you whatwg people have proven yourself very good at that.”
It’s more of an issue of enough Web developers not having considered the ecosystem beyond the browser that they tested with why, in order to achieve the goal of making Web-compatible specs, the WHATWG needs to pay more attention to the quirks of browsers that have been popular at some point as opposed to codifying the quirks of fringe browsers or non-browser software.
I had to argue once on Stack overflow that unencoded spaces in a URL path weren’t actually legal even though some browser decided to convert the %20’s back into spaces in the address bar. They felt the illegal spaces give you better SEO. I don’t think they were convinced of my argument of correctness. Then there’s Microsoft’s Url class which is mostly great except had this huge glaring (imo) defect where ToString() provides a human readable thing that will happily give you illegal URLs. That wouldn’t be so bad except that every example on the planet uses ToString() to get a string out of the URL object. My anecdotes are to illustrate the problem with overly permissiveness. sure, a user interface should be able to accept anything typed into the address bar, but the actual RFCs should be strict.
TBH I think WHATWG should decide whether they want to describe existing behaviour or whether they want to make a specification; the two can overlap, but are vastly different. Basing the second on the first is impossible.
For instance, w3m and links (yes, real browsers) also consider ‘http:///foo.bar’ as either a bad url or using the domain name ‘/foo.bar’ (which, incidentally, IS a valid domain name, just not a valid host name). So the reflecting-the-real-world-spec should say this is not allowed. Obviously some browsers do accept it, so the reflecting-the-real-world-spec should say it is allowed.
And if it wants to obsolete any IETF RFC i would expect much more detail about the format and move all actual algorithm to an appendix or so. Tell me what to implement, not how.
As they say in philosophy, “You can’t get an ‘ought’ from an ‘is’.”
There is a more important place than the address bar where browsers will escape spaces and encode non-ASCII characters: in HTML, e.g. for the value of an href attribute. I expect that browsers agree about this much than they might agree about the address bar (or at least in more important ways), which is after all a piece of UI that various browsers have innovated on (or tried to…) at different times.
it seems like the underlying question noone wants to answer definitively is if URLs are meant for machines to consume strictly, or for humans to consume fuzzily. cURL is (from my perspective) on the side of machines since cURL is used by so many automated systems without human intervention. that the browser people will always favor less strict conventions is no surprise, if I’m correct in this breakdown. before DNS this could have been decided easily in favor of machines, but the waters are a little muddier now. the best way to fix this might require browsers to come up with a completely different human-favoring system of IP resolution that doesn’t care about protocols and has a more lax syntax, but isn’t confusable for a machine-readable URL at all.
so basically, what tjeb said. 😉
The whole WHATWG URL’s effort sole purpose is to keep the people involved busy and paid. We already have the relevant RFCs, they’re not perfect, so what? Are those few issues worth all this re-standardization trouble?
And then occasionally you see some app or device that still wants to replace spaces with + instead of %20, and then somewhere else in the chain of redirects those get converted to %2B, and then there’s no way to figure out original intent.
This could be solved by standardizing things as two separate items:
1. The URL syntax itself, the thing that appears on the wire in the protocol.
2. A “browser acceptance transformation” or some similar pretentious term, that defines what is legal input to a browser, and how to translate that into a proper URL syntax.
This gets even more complicated when you consider non-“web” URLs like LDAP, Databases (my fav is sqlite:////path/to/file which is 3 slashes followed by the full path), etc.
One day I dream of a real RFC where we mandate UTF-8 and get over the ASCII business except where it works, deal with the perennial space problem in a conclusive manner, and can establish a proper algorithm for parsing these things that can be implemented and re-implemented in whatever language du jour.
The reason file:/// has three slashes is because there is actually a domain part, but if it’s empty it’s just localhost. Like, file:///home/me is equivalent to file://localhost/home/me. Just as well, it could be file://foreigncomputer/home/me, too. I don’t think that design is particularly weird, personally. It’d be interesting to see http:///index.html actually mean http://localhost/index.html — but then again this whole syntax just seems to confuse people.
Citation: https://tools.ietf.org/html/rfc1738#section-3.10
@luiji, I purposely didn’t get into file:// because it has “always” had three slashes and can easily have more. See also Matthew Kerwin’s hard work trying to refresh the FILE URI spec over at https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme-09
Specs are supposed to define what is the ‘strict’ portion of Postel’s law. Writing a spec to purposefully degrade what is considered strict strikes me as goofy. Once you degrade the standard, people will just find new and inventive ways to continue butchering their output.
Of course browsers, cURL, etc should make a best effort to interpret bad input, but success is not and should not be guaranteed. cURL should certainly issue warnings when it receives a nonconforming URL.
Maybe it would be helpful if browsers were able to return warnings to the server that feeds them bad input.
As long as specifications don’t clearly define error-handling behavior, and the organizations writing those specifications don’t publish tests for that behavior, implementations will be somewhat lax in handling of errors. Historically, every time that browser market share shifts significantly (i.e., the market leader changes), browsers other than the leader have to become more lax by reverse-engineering cases that the market leader allows but they do not, because otherwise users refuse to use those browsers because some content doesn’t work. This inevitably leads to the followers being *more* lax than the leader since reverse-engineering is imprecise, and when one of the followers becomes the leader, this cycle leads to an increase in laxness.
(One of those areas, by the way, is that browsers have accepted non-ASCII characters in URLs since long before the IRI spec came into existence. If memory serves, the IRI spec was also somewhat incompatible with that legacy.)
The way that we currently believe is most effective at stopping this cycle is to fully specify error handling behavior. This is my understanding of what the WHATWG URL spec is trying to do. This reduces the amount of energy that browser developers have to spend on reverse-engineering each others’ behavior, and thus increases the amount of time we can spend making improvements to Web technology that help end-users and developers.