I talked with Ed Hoover on the between screens podcast a while ago and that episode has now been published. It is a dense 12 minutes as the good Ed edited it massively.
I’m employed by Mozilla. The same Mozilla that recently has announced that it is looking around for feedback on how to revamp its logo and graphical image.
It was with amusement I saw one of the existing suggestions for a new logo by using “://” (colon slash slash) the name:
… compared with the recently announced new curl logo:
Me being in both teams and being a general Internet protocol enthusiast I couldn’t be more happy if Mozilla would end up using a design so clearly based on the same underlying thoughts. After all,
Imitation is the sincerest of flattery as Charles Caleb Colton once so eloquently expressed it.
When I started the precursor to the curl project, httpget, back in 1996, I wrote my first URL parser. Back then, the universal address was still called URL: Uniform Resource Locators. That spec was published by the IETF in 1994. The term “URL” was then used as source for inspiration when naming the tool and project curl.
The term URL was later effectively changed to become URI, Uniform Resource Identifiers (published in 2005) but the basic point remained: a syntax for a string to specify a resource online and which protocol to use to get it. We claim curl accepts “URLs” as defined by this spec, the RFC 3986. I’ll explain below why it isn’t strictly true.
There was also a companion RFC posted for IRI: Internationalized Resource Identifiers. They are basically URIs but allowing non-ascii characters to be used.
The WHATWG consortium later produced their own URL spec, basically mixing formats and ideas from URIs and IRIs with a (not surprisingly) strong focus on browsers. One of their expressed goals is to “Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process“. They want to go back and use the term “URL” as they rightfully state, the terms URI and IRI are just confusing and no humans ever really understood them (or often even knew they exist).
The WHATWG spec follows the good old browser mantra of being very liberal in what it accepts and trying to guess what the users mean and bending backwards trying to fulfill. (Even though we all know by now that Postel’s Law is the wrong way to go about this.) It means it’ll handle too many slashes, embedded white space as well as non-ASCII characters.
From my point of view, the spec is also very hard to read and follow due to it not describing the syntax or format very much but focuses far too much on mandating a parsing algorithm. To test my claim: figure out what their spec says about a trailing dot after the host name in a URL.
On top of all these standards and specs, browsers offer an “address bar” (a piece of UI that often goes under other names) that allows users to enter all sorts of fun strings and they get converted over to a URL. If you enter “http://localhost/%41” in the address bar, it’ll convert the percent encoded part to an ‘A’ there for you (since 41 in hex is a capital A in ASCII) but if you type “http://localhost/A A” it’ll actually send “/A%20A” (with a percent encoded space) in the outgoing HTTP GET request. I’m mentioning this since people will often think of what you can enter there as a “URL”.
The above is basically my (skewed) perspective of what specs and standards we have so far to work with. Now we add reality and let’s take a look at what sort of problems we get when my URL isn’t your URL.
So what is a URL?
Or more specifically, how do we write them. What syntax do we use.
I think one of the biggest mistakes the WHATWG spec has made (and why you will find me argue against their spec in its current form with fierce conviction that they are wrong), is that they seem to believe that URLs are theirs to define and work with and they limit their view of URLs for browsers, HTML and their address bars. Sure, they are the big companies behind the browsers almost everyone uses and URLs are widely used by browsers, but URLs are still much bigger than so.
The WHATWG view of a URL is not widely adopted outside of browsers.
If we ask users, ordinary people with no particular protocol or web expertise, what a URL is what would they answer? While it was probably more notable years ago when the browsers displayed it more prominently, the :// (colon-slash-slash) sequence will be high on the list. Seeing that marks the string as a URL.
Heck, going beyond users, there are email clients, terminal emulators, text editors, perl scripts and a bazillion other things out there in the world already that detects URLs for us and allows operations on that. It could be to open that URL in a browser, to convert it to a clickable link in generated HTML and more. A vast amount of said scripts and programs will use the colon-slash-slash sequence as a trigger.
The WHATWG spec says it has to be one slash and that a parser must accept an indefinite amount of slashes. “http:/example.com” and “http:////////////////////////////////////example.com” are both equally fine. RFC 3986 and many others would disagree. Heck, most people I’ve confronted the last few days, even people working with the web, seem to say, think and believe that a URL has two slashes. Just look closer at the google picture search screen shot at the top of this article, which shows the top images for “URL” google gave me.
We just know a URL has two slashes there (and yeah, file: URLs most have three but lets ignore that for now). Not one. Not three. Two. But the WHATWG doesn’t agree.
“Is there really any reason for accepting more than two slashes for non-file: URLs?” (my annoyed question to the WHATWG)
The spec says so because browsers have implemented the spec.
No better explanation has been provided, not even after I pointed out that the statement is wrong and far from all browsers do. You may find reading that thread educational.
In the curl project, we’ve just recently started debating how to deal with “URLs” having another amount of slashes than two because it turns out there are servers sending back such URLs in Location: headers, and some browsers are happy to oblige. curl is not and neither is a lot of other libraries and command line tools. Who do we stand up for?
A space character (the ASCII code 32, 0x20 in hex) cannot be part of a URL. If you want it sent, you percent encode it like you do with any other illegal character you want to be part of the URL. Percent encoding is the byte value in hexadecimal with a percent sign in front of it. %20 thus means space. It also means that a parser that for example scans for URLs in a text knows that it reaches the end of the URL when the parser encounters a character that isn’t allowed. Like space.
Browsers typically show the address in their address bars with all %20 instances converted to space for appearance. If you copy the address there into your clipboard and then paste it again in your text editor you still normally get the spaces as %20 like you want them.
I’m not sure if that is the reason, but browsers also accept spaces as part of URLs when for example receiving a redirect in a HTTP response. That’s passed from a server to a client using a Location: header with the URL in it. The browsers happily allow spaces in that URL, encode them as %20 and send out the next request. This forced curl into accepting spaces in redirected “URLs”.
Making URLs support non-ASCII languages is of course important, especially for non-western societies and I’ve understood that the IRI spec was never good enough. I personally am far from an expert on these internationalization (i18n) issues so I just go by what I’ve heard from others. But of course users of non-latin alphabets and typing systems need to be able to write their “internet addresses” to resources and use as links as well.
In an ideal world, we would have the i18n version shown to users and there would be the encoded ASCII based version below, to get sent over the wire.
For international domain names, the name gets converted over to “punycode” so that it can be resolved using the normal system name resolvers that know nothing about non-ascii names. URIs have no IDN names, IRIs do and WHATWG URLs do. curl supports IDN host names.
WHATWG states that URLs are specified as UTF-8 while URIs are just ASCII. curl gets confused by non-ASCII letters in the path part but percent encodes such byte values in the outgoing requests – which causes “interesting” side-effects when the non-ASCII characters are provided in other encodings than UTF-8 which for example is standard on Windows…
Similar to what I’ve written above, this leads to servers passing back non-ASCII byte codes in HTTP headers that browsers gladly accept, and non-browsers need to deal with…
No URL standard
I’ve not tried to write a conclusive list of problems or differences, just a bunch of things I’ve fallen over recently. A “URL” given in one place is certainly not certain to be accepted or understood as a “URL” in another place.
Not even curl follows any published spec very closely these days, as we’re slowly digressing for the sake of “web compatibility”.
There’s no unified URL standard and there’s no work in progress towards that. I don’t count WHATWG’s spec as a real effort either, as it is written by a closed group with no real attempts to get the wider community involved.
I’m employed by Mozilla and Mozilla is a member of WHATWG and I have colleagues working on the WHATWG URL spec and other work items of theirs but it makes absolutely no difference to what I’ve written here. I also participate in the IETF and I consider myself friends with authors of RFC 1738, RFC 3986 and others but that doesn’t matter here either. My opinions are my own and this is my personal blog.
On April 12 I had the pleasure of doing another talk in the Google Tech Talk series arranged in the Google Stockholm offices. I had given it the title “HTTP/2 is upon us, and here’s what you need to know about it.” in the invitation.
The room seated 70 persons but we had the amazing amount of over 300 people in the waiting line who unfortunately didn’t manage to get a seat. To those, and to anyone else who cares, here’s the video recording of the event.
If you’ve seen me talk about HTTP/2 before, you might notice that I’ve refreshed the material somewhat since before.
I get to work with open source all day, every day. I get to work for a company that isn’t driven by handing over profits to its owners for some sort of return on investment. I get to work on curl as part of my job. I get to work with internetworking, which is awesomely fun, hard, thrilling and hair-tearing all at once. I get to work with protocol standards like within the IETF and my employer can let me go to meetings. In the struggle for good, against evil and for the users of the world, I think I’m on the right side. For users, for privacy, for openness, for inclusiveness. I feel I’m a mozillian now.
So what did I achieve during my first two years with the dinosaur logo company? Not nearly enough of what I’ve wanted or possibly initially thought I would. I’ve faced a lot of tough bugs and hard challenges and I’ve landed and backed out changes all through-out this period. But I like to think that it is a net gain and even when running head first into a wall, that can be educational and we can learn from it and then when we take a few steps back and race forwards again we can use that knowledge and make better decision for the future.
Future you say? Yeah, I’m heading on in the same style, without raising my focus point very much and continuously looking for my next thing very close in time. I grab issues to work on with as little foresight as possible but I completely assume they will continue to be tough nuts to crack and there will be new networking issues to conquer going forward as well. I’ll keep working on open source, open standards and a better internet for users. I really enjoy working for Mozilla!
At times I post blog articles that get the view counter go up to and beyond 50,000 views. This puts me in a position where I get offers from companies to mention them or to “cooperate” on further blog posts that would somehow push their agenda or businesses.
I also get the more simple offers of adding random ads or “text only information” on specific individual pages on my sites that some SEO person out there figured out could potentially attract audience that search for specific terms.
I’ve even gotten offers from a company to sell off my server logs. Allegedly to help them work on anti-fraud so possibly for a good cause, but still…
This is by no counts a “big” blog or site, yet I get a steady stream of individuals and companies offering me money to give up a piece of my soul. I can only imagine what more popular sites get and it is clear that someone with a less strict standpoint than mine could easily make an extra income that way.
I turn down all those examples of “easy money”.
I want to be able to look you, my dear readers, straight in the eyes when I say that what’s written here are my own words and the opinions revealed are my own – even if of course you may not agree with me and I may do mistakes and be completely wrong at times or even many times. You can rest assured that I did the mistakes on my own and I was not paid by anyone to do them.
I’ve also removed ads from most of my sites and I don’t run external analytic scripts, minimizing the privacy intrusions and optimizing the contents: the stuff downloaded from my sites are what your browser needs to render the page. Not heaps of useless crap to show ads or to help anyone track you (in order to show more targeted ads).
I don’t judge others’ actions based on how I decide to run my blog. I’m in a fortunate position to take this stand, I realize that.
Still biased of course
This all said, I’m still employed by a company (Mozilla) that pays my salary and I work on several projects that are dear to me so of course I will show bias to some subjects. I don’t claim to have an objective view on things and I don’t even try to have that. When I write posts here, they come colored by my background and by what I am.
I’ve met a bunch of new faces and friends here at the HTTP Workshop in Münster. Several who I’ve only seen or chatted with online before and some that I never interacted with until now. Pretty awesome really.
Out of the almost forty HTTP fanatics present at this workshop, five persons are from Google, four from Mozilla (including myself) and Akamai has three employees here. Those are the top-3 companies. There are a few others with 2 representatives but most people here are the only guys from their company. Yes they are all guys. We are all guys. The male dominance at this event is really extreme and we’ve discussed this sad circumstance during breaks and it hasn’t gone unnoticed.
This particular day started out grand with Eric Rescorla (of Mozilla) talking about HTTP Security in his marvelous high-speed style. Lots of talk about how how the HTTPS usage is right now onÂ the web, HTTPS trends, TLS 1.3 details and when it is coming and we got into a lot of talk about how HTTP deprecation and what can and cannot be done etc.
Next up was a presentation about HTTP Privacy and Anonymity by Mike Perry (from the Tor project) about lots of aspects of what the Tor guys consider regarding fingerprinting, correlation, network side-channels and similar things that can be used to attempt to track user or usage over the Tor network. We got into details about what recent protocols like HTTP/2 and QUIC “leak” or open up for fingerprinting and what (if anything) can or could be done to mitigate the effects.
Evolving HTTP Header Fields by Julian Reschke (of Green Bytes) then followed, discussing all the variations of header syntax that we have in HTTP and how it really is not possible to write a generic parser that can handle them, with a suggestion on how to unify this and introduce a common format for future new headers. Julian’s suggestion to use JSON for this ignited a discussion about header formats in general and what should or could be done for HTTP/3 and if keeping support for the old formats is necessary or not going forward. No real consensus was reached.
Willy Tarreau (from HAProxy) then took us into the world of HTTP Infrastructure scaling and Load balancing, and showed us on the microsecond level how fast a load balancer can be, how much extra work adding HTTPS can mean and then ending with a couple suggestions of what he thinks could’ve helped his scenario. That then turned into a general discussion and network architecture brainstorm on what can be done, how it could be improved and what TLS and other protocols could possibly be do to aid. Cramming out every possible gigabit out of load balancers certainly is a challange.
Talking about cramming bits, Kazuho Oku got to show the final slides when he showed how he’s managed to get his picohttpparser to parse HTTP/1 headers at a speed that is only slightly slower than strlen() – including a raw dump of the x86 assembler the code is turned into by a compiler. What could possibly be a better way to end a day full of protocol geekery?
Google graciously sponsored the team dinner in the evening at a Peruvian place in the town! Yet another fully packed day has ended.
I’ll top off today’s summary with a picture of the gift Mark Nottingham (who’s herding us through these days) was handing out today to make us stay keen and alert (Mark pointed out to me that this was a gift from one of our Japanese friends here):
My series of weekly videos, in lack of a better name called daniel weekly, reached episode 35 today. I’m celebrating this fact by also adding an RSS-feed for those of you who prefer to listen to me in an audio-only version.
As an avid podcast listener myself, I can certainly see how this will be a better fit to some. Most of these videos are just me talking anyway so losing the visual shouldn’t be much of a problem.
A typical episode
I talk about what I work on in my open source projects, which means a lot of curl stuff and occasional stuff from my work on Firefox for Mozilla. I also tend to mention events I attend and HTTP/networking developments I find interesting and grab my attention. Lots of HTTP/2 talk for example. I only ever express my own personal opinions.
It is generally an extremely geeky and technical video series.
Every week I mention a (curl) “bug of the week” that allows me to joke or rant about the bug in question or just mention what it is about. In episode 31 I started my “command line options of the week” series in which I explain one or a few curl command line options with some amount of detail. There are over 170 options so the series is bound to continue for a while. I’ve explained ten options so far.
I’ve set a limit for myself and I make an effort to keep the episodes shorter than 20 minutes. I’ve not succeed every time.
The 35 episodes have been viewed over 17,000 times in total. Episode two is the most watched individual one with almost 1,500 views.
Right now, my channel has 190 subscribers.
The top-3 countries that watch my videos: USA, Sweden and UK.
Share of viewers that are female: 3.7%
I talked in the Mozilla devroom at FOSDEM 2015. Here are the slides from it. It was recorded on video and I will post a suitable link to that once it becomes available. The talk was meant to be 20 minutes, I think I did it on 22 or something.
Sunday 13:00, embedded room (Lameere)
Embedded devices are very often network connected these days. Network connected embedded devices often need to transfer data to and from them as clients, using one or more of the popular internet protocols.
libcurl is the world’s most used and most popular internet transfer library, already used in every imaginable sort of embedded device out there. How did this happen and how do you use libcurl to transfer data to or from your device?
Note that this talk was originally scheduled to be at a different time!
Sunday, 09:00 Mozilla room (UD2.218A)
Title: HTTP/2 right now
HTTP/2 is the new version of the web’s most important and used protocol. Version 2 is due to be out very soon after FOSDEM and I want to inform the audience about what’s going on with the protocol, why it matters to most web developers and users and not the last what its status is at the time of FOSDEM.