Tag Archives: HTTP

Post FOSDEM 2017

I attended FOSDEM again in 2017 and it was as intense, chaotic and wonderful as ever. I met old friends, got new friends and I got to test a whole range of Belgian beers. Oh, and there was also a set of great open source related talks to enjoy!

On Saturday at 2pm I delivered my talk on curl in the main track in the almost frighteningly large room Janson. I estimate that it was almost half full, which would mean upwards 700 people in the audience. The talk itself went well. I got audible responses from the audience several times and I kept well within my given time with time over for questions. The trickiest problem was the audio from the people who asked questions because it wasn’t at all very easy to hear, while the audio is great for the audience and in the video recording. Slightly annoying because as everyone else heard, it made me appear half deaf. Oh well. I got great questions both then and from people approaching me after the talk. The questions and the feedback I get from a talk is really one of the things that makes me appreciate talking the most.

The video of the talk is available, and the slides can also be viewed.

So after I had spent some time discussing curl things and handing out many stickers after my talk, I managed to land in the cafeteria for a while until it was time for me to once again go and perform.

We’re usually a team of friends that hang out during FOSDEM and we all went over to the Mozilla room to be there perhaps 20 minutes before my talk was scheduled and wow, there was a huge crowd outside of that room already waiting by the time we arrived. When the doors then finally opened (about 10 minutes before my talk started), I had to zigzag my way through to get in, and there was a large amount of people who didn’t get in. None of my friends from the cafeteria made it in!

The Mozilla devroom had 363 seats, not a single one was unoccupied and there was people standing along the sides and the back wall. So, an estimated nearly 400 persons in that room saw me speak about HTTP/2 deployments numbers right now, how HTTP/2 doesn’t really work well under 2% packet loss situations and then a bit about how QUIC can solve some of that and what QUIC is and when we might see the first experiments coming with IETF-QUIC – which really isn’t the same as Google-QUIC was.

To be honest, it is hard to deliver a talk in twenty minutes and I  was only 30 seconds over my time. I got questions and after the talk I spent a long time talking with people about HTTP, HTTP/2, QUIC, curl and the future of Internet protocols and transports. Very interesting.

The video of my talk can be seen, and the slides are online too.

I’m not sure if I was just unusually unlucky in my choices, or if there really was more people this year, but I experienced that “FULL” sign more than usual this year.

I fully intend to return again next year. Who knows, maybe I’ll figure out something to talk about then too. See you there?

QUIC is h2 over UDP

The third day of the QUIC interim passed and now that meeting has ended. It continued to work very well to attend from remote and the group manged to plow through an extensive set of issues. A lot of consensus was achieved and I personally now have a much better feel for the protocol and many of its details thanks to the many discussions.

The drafts are still a bit too early for us to start discussing inter-op for real. But there were mentions and hopes expressed that maybe maybe we might start to see some of that by mid 2017. When we did HTTP/2, we had about 10 different implementations by the time draft-04 was out. I suspect we will see a smaller set for QUIC simply because of it being much more complex.

The next interim is planned to occur in the beginning of June in Europe.

There is an official QUIC logo being designed, but it is not done yet so you still need to imagine one placed here.

QUIC needs HTTP/2 needs HTTP/1

QUIC is primarily designed to send and receive HTTP/2 frames and entire streams over UDP (not only, but this is where the bulk of the work has been put in so far). Sure, TLS encrypted and everything, but my point here is that it is being designed to transfer HTTP/2 frames. You remember how HTTP/2 is “just a new framing” layer that changes how HTTP is sent over the wire, but when “decoded” again in the receiving end it is in most important aspects still HTTP/1 there. You have to implement most of a HTTP/1 stack in order to support HTTP/2. Now QUIC adds another layer to that. QUIC is a new way to send HTTP/2 frames over the network.

A QUIC stack needs to handle most aspects of HTTP/2!

Of course, there are notable differences and changes to some underlying principles that makes QUIC a bit different. It isn’t exactly HTTP/2 over secure UDP. Let me give you a few examples…

Streams are more independent

Packets sent over the wire with UDP are independent from each other to a very large degree. In order to avoid Head-of-Line blocking (HoL), packets that are lost and re-transmitted will only block the particular streams to which the lost packets belong. The other streams can keep flowing, unaware and uncaring.

Thanks to the nature of the Internet and how packets are handled, it is not unusual for network packets to arrive in a slightly different order than they were sent, even when they aren’t exactly “lost”.

So, streams in HTTP/2 were entirely synced and the order the sender of frames use, will be the exact same order in which the frames arrive in the other end. Packet loss or not.

In QUIC, individual frames and entire streams may arrive in the receiver in a different order than what was used in the sender.

Stream ID gaps means open

When receiving a QUIC packet, there’s basically no way to know if there are packets missing that were intended to arrive but got lost and haven’t yet been re-transmitted.

If a frame is received that uses the new stream ID N (a stream not previously seen), the receiver is then forced to assume that all the other streams ID from our previously highest ID to N are all just missing and will arrive soon. They are then presumed to exist!

In HTTP/2, we could handle gaps in stream IDs much differently because of TCP. Then a gap is known to be deliberate.

Some h2 frames are done by QUIC

Since QUIC is designed with streams, flow control and more and is used to send HTTP/2 frames over them, some of the h2 frames aren’t needed but are instead handled by the transport layer within QUIC and won’t show up in the HTTP/2 layer.

HPACK goes QPACK?

HPACK is the header compression system used in HTTP/2. Among other things it features a dictionary that you manipulate with instructions and then subsequent header frames can refer to those dictionary indexes instead of sending the full header. Header frame one says “insert my user-agent string” and then header frame two can refer back to the index in the dictionary for where that identical user-agent string is stored.

Due to the out of order streams in QUIC, this dictionary treatment is harder. The second header frame could arrive before the first, so if it would refer to an index set in the first header frame, it would have to block the entire stream until that first header arrives.

HPACK also has a concept of just adding things to the dictionary without specifying the index, and since both sides are in perfect sync it works just fine. In QUIC, if we want to maintain the independence of streams and avoid blocking to the highest degree, we need to instead specify exact indexes to use and not assume perfect sync.

This (and more) are reasons why QPACK is being suggested as a replacement for HPACK when HTTP/2 header frames are sent over QUIC.

My first 20 years of HTTP

During the autumn 1996 I took my first swim in the ocean known as HTTP. Twenty years ago now.

I had previously worked with writing an IRC bot in C, and IRC is a pretty simple text based protocol over TCP so I could use some experiences from that when I started to look into HTTP. That IRC bot was my first real application distributed to the world that was using TCP/IP. It was portable to most unixes and Amiga and it was open source.

1996 was the year the movie Independence Day premiered and the single hit song that plagued the world more than others that year was called Macarena. AOL, Webcrawler and Netscape were the most popular websites on the Internet. There were less than 300,000 web sites on the Internet (compared to some 900 million today).

I decided I should spice up the bot and make it offer a currency exchange rate service so that people who were chatting could ask the bot what 200 SEK is when converted to USD or what 50 AUD might be in DEM. – Right, there was no Euro currency yet back then!

I simply had to fetch the currency rates at a regular interval and keep them in the same server that ran the bot. I just needed a little tool to download the rates over HTTP. How hard can that be? I googled around (this was before Google existed so that was not the search engine I could use!) and found a tool named ‘httpget’ that made pretty much what I wanted. It truly was tiny – a few hundred nokia-1610lines of code.

I don’t have an exact date saved or recorded for when this happened, only the general time frame. You know, we had no smart phones, no Google calendar and no digital cameras. I sported my first mobile phone back then, the sexy Nokia 1610 – viewed in the picture on the right here.

The HTTP/1.0 RFC had just recently came out – which was the first ever real spec published for HTTP. RFC 1945 was published in May 1996, but I was blissfully unaware of the youth of the standard and I plunged into my little project. This was the first published HTTP spec and it says:

HTTP has been in use by the World-Wide Web global information initiative since 1990. This specification reflects common usage of the protocol referred too as "HTTP/1.0". This specification describes the features that seem to be consistently implemented in most HTTP/1.0 clients and servers.

Many years after that point in time, I have learned that already at this time when I first searched for a HTTP tool to use, wget already existed. I can’t recall that I found that in my searches, and if I had found it maybe history would’ve made a different turn for me. Or maybe I found it and discarded for a reason I can’t remember now.

I wasn’t the original author of httpget; Rafael Sagula was. But I started contributing fixes and changes and soon I was the maintainer of it. Unfortunately I’ve lost my emails and source code history from those earliest years so I cannot easily show my first steps. Even the oldest changelogs show that we very soon got help and contributions from users.

The earliest saved code archive I have from those days, is from after we had added support for Gopher and FTP and renamed the tool ‘urlget’. urlget-3.5.zip was released on January 20 1998 which thus was more than a year later my involvement in httpget started.

The original httpget/urlget/curl code was stored in CVS and it was licensed under the GPL. I did most of the early development on SunOS and Solaris machines as my first experiments with Linux didn’t start until 97/98 something.

sparcstation-ipc

The first web page I know we have saved on archive.org is from December 1998 and by then the project had been renamed to curl already. Roughly two years after the start of the journey.

RFC 2068 was the first HTTP/1.1 spec. It was released already in January 1997, so not that long after the 1.0 spec shipped. In our project however we stuck with doing HTTP 1.0 for a few years longer and it wasn’t until February 2001 we first started doing HTTP/1.1 requests. First shipped in curl 7.7. By then the follow-up spec to HTTP/1.1, RFC 2616, had already been published as well.

The IETF working group called HTTPbis was started in 2007 to once again refresh the HTTP/1.1 spec, but it took me a while until someone pointed out this to me and I realized that I too could join in there and do my part. Up until this point, I had not really considered that little me could actually participate in the protocol doings and bring my views and ideas to the table. At this point, I learned about IETF and how it works.

I posted my first emails on that list in the spring 2008. The 75th IETF meeting in the summer of 2009 was held in Stockholm, so for me still working  on HTTP only as a spare time project it was very fortunate and good timing. I could meet a lot of my HTTP heroes and HTTPbis participants in real life for the first time.

I have participated in the HTTPbis group ever since then, trying to uphold the views and standpoints of a command line tool and HTTP library – which often is not the same as the web browsers representatives’ way of looking at things. Since I was employed by Mozilla in 2014, I am of course now also in the “web browser camp” to some extent, but I remain a protocol puritan as curl remains my first “child”.

HTTP/2 connection coalescing

Section 9.1.1 in RFC7540 explains how HTTP/2 clients can reuse connections. This is my lengthy way of explaining how this works in reality.

Many connections in HTTP/1

With HTTP/1.1, browsers are typically using 6 connections per origin (host name + port). They do this to overcome the problems in HTTP/1 and how it uses TCP – as each connection will do a fair amount of waiting. Plus each connection is slow at start and therefore limited to how much data you can get and send quickly, you multiply that data amount with each additional connection. This makes the browser get more data faster (than just using one connection).

6 connections

Add sharding

Web sites with many objects also regularly invent new host names to trigger browsers to use even more connections. A practice known as “sharding”. 6 connections for each name. So if you instead make your site use 4 host names you suddenly get 4 x 6 = 24 connections instead. Mostly all those host names resolve to the same IP address in the end anyway, or the same set of IP addresses. In reality, some sites use many more than just 4 host names.

24 connections

The sad reality is that a very large percentage of connections used for HTTP/1.1 are only ever used for a single HTTP request, and a very large share of the connections made for HTTP/1 are so short-lived they actually never leave the slow start period before they’re killed off again. Not really ideal.

One connection in HTTP/2

With the introduction of HTTP/2, the HTTP clients of the world are going toward using a single TCP connection for each origin. The idea being that one connection is better in packet loss scenarios, it makes priorities/dependencies work and reusing that single connections for many more requests will be a net gain. And as you remember, HTTP/2 allows many logical streams in parallel over that single connection so the single connection doesn’t limit what the browsers can ask for.

Unsharding

The sites that created all those additional host names to make the HTTP/1 browsers use many connections now work against the HTTP/2 browsers’ desire to decrease the number of connections to a single one. Sites don’t want to switch back to using a single host name because that would be a significant architectural change and there are still a fair number of HTTP/1-only browsers still in use.

Enter “connection coalescing”, or “unsharding” as we sometimes like to call it. You won’t find either term used in RFC7540, as it merely describes this concept in terms of connection reuse.

Connection coalescing means that the browser tries to determine which of the remote hosts that it can reach over the same TCP connection. The different browsers have slightly different heuristics here and some don’t do it at all, but let me try to explain how they work – as far as I know and at this point in time.

Coalescing by example

Let’s say that this cool imaginary site “example.com” has two name entries in DNS: A.example.com and B.example.com. When resolving those names over DNS, the client gets a list of IP address back for each name. A list that very well may contain a mix of IPv4 and IPv6 addresses. One list for each name.

You must also remember that HTTP/2 is also only ever used over HTTPS by browsers, so for each origin speaking HTTP/2 there’s also a corresponding server certificate with a list of names or a wildcard pattern for which that server is authorized to respond for.

In our example we start out by connecting the browser to A. Let’s say resolving A returns the IPs 192.168.0.1 and 192.168.0.2 from DNS, so the browser goes on and connects to the first of those addresses, the one ending with “1”. The browser gets the server cert back in the TLS handshake and as a result of that, it also gets a list of host names the server can deal with: A.example.com and B.example.com. (it could also be a wildcard like “*.example.com”)

If the browser then wants to connect to B, it’ll resolve that host name too to a list of IPs. Let’s say 192.168.0.2 and 192.168.0.3 here.

Host A: 192.168.0.1 and 192.168.0.2
Host B: 192.168.0.2 and 192.168.0.3

Now hold it. Here it comes.

The Firefox way

Host A has two addresses, host B has two addresses. The lists of addresses are not the same, but there is an overlap – both lists contain 192.168.0.2. And the host A has already stated that it is authoritative for B as well. In this situation, Firefox will not make a second connect to host B. It will reuse the connection to host A and ask for host B’s content over that single shared connection. This is the most aggressive coalescing method in use.

one connection

The Chrome way

Chrome features a slightly less aggressive coalescing. In the example above, when the browser has connected to 192.168.0.1 for the first host name, Chrome will require that the IPs for host B contains that specific IP for it to reuse that connection.  If the returned IPs for host B really are 192.168.0.2 and 192.168.0.3, it clearly doesn’t contain 192.168.0.1 and so Chrome will create a new connection to host B.

Chrome will reuse the connection to host A if resolving host B returns a list that contains the specific IP of the connection host A is already using.

The Edge and Safari ways

They don’t do coalescing at all, so each host name will get its own single connection. Better than the 6 connections from HTTP/1 but for very sharded sites that means a lot of connections even in the HTTP/2 case.

curl also doesn’t coalesce anything (yet).

Surprises and a way to mitigate them

Given some comments in the Firefox bugzilla, the aggressive coalescing sometimes causes some surprises. Especially when you have for example one IPv6-only host A and a second host B with both IPv4 and IPv4 addresses. Asking for data on host A can then still use IPv4 when it reuses a connection to B (assuming that host A covers host B in its cert).

In the rare case where a server gets a resource request for an authority (or scheme) it can’t serve, there’s a dedicated error code 421 in HTTP/2 that it can respond with and the browser can then  go back and retry that request on another connection.

Starts out with 6 anyway

Before the browser knows that the server speaks HTTP/2, it may fire up 6 connection attempts so that it is prepared to get the remote site at full speed. Once it figures out that it doesn’t need all those connections, it will kill off the unnecessary unused ones and over time trickle down to one. Of course, on subsequent connections to the same origin the client may have the version information cached so that it doesn’t have to start off presuming HTTP/1.

A third day of deep HTTP inspection

The workshop roomThis fine morning started off with some news: Patrick is now our brand new official co-chair of the IETF HTTPbis working group!

Subodh then sat down and took us off on a presentation that really triggered a long and lively discussion. “Retry safety extensions” was his name of it but it involved everything from what browsers and HTTP clients do for retrying with no response and went on to also include replaying problems for 0-RTT protocols such as TLS 1.3.

Julian did a short presentation on http headers and his draft for JSON in new headers and we quickly fell down a deep hole of discussions around various formats with ups and downs on them all. The general feeling seems to be that JSON will not be a good idea for headers in spite of a couple of good characteristics, partly because of its handling of duplicate field entries and how it handles or doesn’t handle numerical precision (ie you can send “100” as a monstrously large floating point number).

Mike did a presentation he called “H2 Regrets” in which he covered his work on a draft for support of client certs which was basically forbidden due to h2’s ban of TLS renegotiation, he brought up the idea of extended settings and discussed the lack of special handling dates in HTTP headers (why we send 29 bytes instead of 4). Shows there are improvements to be had in the future too!

Martin talked to us about Blind caching and how the concept of this works. Put very simply: it is a way to make it possible to offer cached content for clients using HTTPS, by storing the data in a 3rd host and pointing out that data to the client. There was a lengthy discussion around this and I think one of the outstanding questions is if this feature is really giving as much value to motivate the rather high cost in complexity…

The list of remaining Lightning Talks had grown to 10 talks and we fired them all off at a five minutes per topic pace. I brought up my intention and hope that we’ll do a QUIC library soon to experiment with. I personally particularly enjoyed EKR’s TLS 1.3 status summary. I heard appreciation from others and I agree with this that the idea to feature lightning talks was really good.

With this, the HTTP Workshop 2016 was officially ended. There will be a survey sent out about this edition and what people want to do for the next/future ones, and there will be some sort of  report posted about this event from the organizers, summarizing things.

Attendees numbers

http workshopThe companies with most attendees present here were: Mozilla 5, Google 4, Facebook, Akamai and Apple 3.

The attendees were from the following regions of the world: North America 19, Europe 15, Asia/pacific 6.

38 participants were male and 2 female.

23 of us were also at the 2015 workshop, 17 were newcomers.

15 people did lightning talks.

I believe 40 is about as many as you can put in a single room and still have discussions. Going larger will make it harder to make yourself heard as easily and would probably force us to have to switch to smaller groups more and thus not get this sort of great dynamic flow. I’m not saying that we can’t do this smaller or larger, just that it would have to make the event different.

Some final words

I had an awesome few days and I loved all of it. It was a pleasure organizing this and I’m happy that Stockholm showed its best face weather wise during these days. I was also happy to hear that so many people enjoyed their time here in Sweden. The hotel and its facilities, including food and coffee etc worked out smoothly I think with no complaints at all.

Hope to see again on the next HTTP Workshop!

No websockets over HTTP/2

There is no websockets for HTTP/2.

By this, I mean that there’s no way to negotiate or upgrade a connection to websockets over HTTP/2 like there is for HTTP/1.1 as expressed by RFC 6455. That spec details how a client can use Upgrade: in a HTTP/1.1 request to switch that connection into a websockets connection.

Note that websockets is not part of the HTTP/1 spec, it just uses a HTTP/1 protocol detail to switch an HTTP connection into a websockets connection. Websockets over HTTP/2 would similarly not be a part of the HTTP/2 specification but would be separate.

(As a side-note, that Upgrade: mechanism is the same mechanism a HTTP/1.1 connection can get upgraded to HTTP/2 if the server supports it – when not using HTTPS.)

chinese-socket

Draft

There’s was once a draft submitted that describes how websockets over HTTP/2 could’ve been done. It didn’t get any particular interest in the IETF HTTP working group back then and as far as I’ve seen, there has been very little general interest in any group to pick up this dropped ball and continue running. It just didn’t go any further.

This is important: the lack of websockets over HTTP/2 is because nobody has produced a spec (and implementations) to do websockets over HTTP/2. Those things don’t happen by themselves, they actually require a bunch of people and implementers to believe in the cause and work for it.

Websockets over HTTP/2 could of course have the benefit that it would only be one stream over the connection that could serve regular non-websockets traffic at the same time in many other streams, while websockets upgraded on a HTTP/1 connection uses the entire connection exclusively.

Instead

So what do users do instead of using websockets over HTTP/2? Well, there are several options. You probably either stick to HTTP/2, upgrade from HTTP/1, use Web push or go the WebRTC route!

If you really need to stick to websockets, then you simply have to upgrade to that from a HTTP/1 connection – just like before. Most people I’ve talked to that are stuck really hard on using websockets are app developers that basically only use a single connection anyway so doing that HTTP/1 or HTTP/2 makes no meaningful difference.

Sticking to HTTP/2 pretty much allows you to go back and use the long-polling tricks of the past before websockets was created. They were once rather bad since they would waste a connection and be error-prone since you’d have a connection that would sit idle most of the time. Doing this over HTTP/2 is much less of a problem since it’ll just be a single stream that won’t be used that much so it isn’t that much of a waste. Plus, the connection may very well be used by other streams so it will be less of a problem with idle connections getting killed by NATs or firewalls.

The Web Push API was brought by W3C during 2015 and is in many ways a more “webby” way of doing push than the much more manual and “raw” method that websockets is. If you use websockets mostly for push notifications, then this might be a more convenient choice.

Also introduced after websockets, is WebRTC. This is a technique introduced for communication between browsers, but it certainly provides an alternative to some of the things websockets were once used for.

Future

Websockets over HTTP/2 could still be done. The fact that it isn’t done just shows that there isn’t enough interest.

Non-TLS

Recall how browsers only speak HTTP/2 over TLS, while websockets can also be done over plain TCP. In fact, the only way to upgrade a HTTP connection to websockets is using the HTTP/1 Upgrade: header trick, and not the ALPN method for TLS that HTTP/2 uses to reduce the number of round-trips required.

If anyone would introduce websockets over HTTP/2, they would then probably only be possible to be made over TLS from within browsers.

My URL isn’t your URL

URLs

When I started the precursor to the curl project, httpget, back in 1996, I wrote my first URL parser. Back then, the universal address was still called URL: Uniform Resource Locators. That spec was published by the IETF in 1994. The term “URL” was then used as source for inspiration when naming the tool and project curl.

The term URL was later effectively changed to become URI, Uniform Resource Identifiers (published in 2005) but the basic point remained: a syntax for a string to specify a resource online and which protocol to use to get it. We claim curl accepts “URLs” as defined by this spec, the RFC 3986. I’ll explain below why it isn’t strictly true.

There was also a companion RFC posted for IRI: Internationalized Resource Identifiers. They are basically URIs but allowing non-ascii characters to be used.

The WHATWG consortium later produced their own URL spec, basically mixing formats and ideas from URIs and IRIs with a (not surprisingly) strong focus on browsers. One of their expressed goals is to “Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process“. They want to go back and use the term “URL” as they rightfully state, the terms URI and IRI are just confusing and no humans ever really understood them (or often even knew they exist).

The WHATWG spec follows the good old browser mantra of being very liberal in what it accepts and trying to guess what the users mean and bending backwards trying to fulfill. (Even though we all know by now that Postel’s Law is the wrong way to go about this.) It means it’ll handle too many slashes, embedded white space as well as non-ASCII characters.

From my point of view, the spec is also very hard to read and follow due to it not describing the syntax or format very much but focuses far too much on mandating a parsing algorithm. To test my claim: figure out what their spec says about a trailing dot after the host name in a URL.

On top of all these standards and specs, browsers offer an “address bar” (a piece of UI that often goes under other names) that allows users to enter all sorts of fun strings and they get converted over to a URL. If you enter “http://localhost/%41” in the address bar, it’ll convert the percent encoded part to an ‘A’ there for you (since 41 in hex is a capital A in ASCII) but if you type “http://localhost/A A” it’ll actually send “/A%20A” (with a percent encoded space) in the outgoing HTTP GET request. I’m mentioning this since people will often think of what you can enter there as a “URL”.

The above is basically my (skewed) perspective of what specs and standards we have so far to work with. Now we add reality and let’s take a look at what sort of problems we get when my URL isn’t your URL.

So what  is a URL?

Or more specifically, how do we write them. What syntax do we use.

I think one of the biggest mistakes the WHATWG spec has made (and why you will find me argue against their spec in its current form with fierce conviction that they are wrong), is that they seem to believe that URLs are theirs to define and work with and they limit their view of URLs for browsers, HTML and their address bars. Sure, they are the big companies behind the browsers almost everyone uses and URLs are widely used by browsers, but URLs are still much bigger than so.

The WHATWG view of a URL is not widely adopted outside of browsers.

colon-slash-slash

If we ask users, ordinary people with no particular protocol or web expertise, what a URL is what would they answer? While it was probably more notable years ago when the browsers displayed it more prominently, the :// (colon-slash-slash) sequence will be high on the list. Seeing that marks the string as a URL.

Heck, going beyond users, there are email clients, terminal emulators, text editors, perl scripts and a bazillion other things out there in the world already that detects URLs for us and allows operations on that. It could be to open that URL in a browser, to convert it to a clickable link in generated HTML and more. A vast amount of said scripts and programs will use the colon-slash-slash sequence as a trigger.

The WHATWG spec says it has to be one slash and that a parser must accept an indefinite amount of slashes. “http:/example.com” and “http:////////////////////////////////////example.com” are both equally fine. RFC 3986 and many others would disagree. Heck, most people I’ve confronted the last few days, even people working with the web, seem to say, think and believe that a URL has two slashes. Just look closer at the google picture search screen shot at the top of this article, which shows the top images for “URL” google gave me.

We just know a URL has two slashes there (and yeah, file: URLs most have three but lets ignore that for now). Not one. Not three. Two. But the WHATWG doesn’t agree.

“Is there really any reason for accepting more than two slashes for non-file: URLs?” (my annoyed question to the WHATWG)

“The fact that all browsers do.”

The spec says so because browsers have implemented the spec.

No better explanation has been provided, not even after I pointed out that the statement is wrong and far from all browsers do. You may find reading that thread educational.

In the curl project, we’ve just recently started debating how to deal with “URLs” having another amount of slashes than two because it turns out there are servers sending back such URLs in Location: headers, and some browsers are happy to oblige. curl is not and neither is a lot of other libraries and command line tools. Who do we stand up for?

Spaces

A space character (the ASCII code 32, 0x20 in hex) cannot be part of a URL. If you want it sent, you percent encode it like you do with any other illegal character you want to be part of the URL. Percent encoding is the byte value in hexadecimal with a percent sign in front of it. %20 thus means space. It also means that a parser that for example scans for URLs in a text knows that it reaches the end of the URL when the parser encounters a character that isn’t allowed. Like space.

Browsers typically show the address in their address bars with all %20 instances converted to space for appearance. If you copy the address there into your clipboard and then paste it again in your text editor you still normally get the spaces as %20 like you want them.

I’m not sure if that is the reason, but browsers also accept spaces as part of URLs when for example receiving a redirect in a HTTP response. That’s passed from a server to a client using a Location: header with the URL in it. The browsers happily allow spaces in that URL, encode them as %20 and send out the next request. This forced curl into accepting spaces in redirected “URLs”.

Non-ASCII

Making URLs support non-ASCII languages is of course important, especially for non-western societies and I’ve understood that the IRI spec was never good enough. I personally am far from an expert on these internationalization (i18n) issues so I just go by what I’ve heard from others. But of course users of non-latin alphabets and typing systems need to be able to write their “internet addresses” to resources and use as links as well.

In an ideal world, we would have the i18n version shown to users and there would be the encoded ASCII based version below, to get sent over the wire.

For international domain names, the name gets converted over to “punycode” so that it can be resolved using the normal system name resolvers that know nothing about non-ascii names. URIs have no IDN names, IRIs do and WHATWG URLs do. curl supports IDN host names.

WHATWG states that URLs are specified as UTF-8 while URIs are just ASCII. curl gets confused by non-ASCII letters in the path part but percent encodes such byte values in the outgoing requests – which causes “interesting” side-effects when the non-ASCII characters are provided in other encodings than UTF-8 which for example is standard on Windows…

Similar to what I’ve written above, this leads to servers passing back non-ASCII byte codes in HTTP headers that browsers gladly accept, and non-browsers need to deal with…

No URL standard

I’ve not tried to write a conclusive list of problems or differences, just a bunch of things I’ve fallen over recently. A “URL” given in one place is certainly not certain to be accepted or understood as a “URL” in another place.

Not even curl follows any published spec very closely these days, as we’re slowly digressing for the sake of “web compatibility”.

There’s no unified URL standard and there’s no work in progress towards that. I don’t count WHATWG’s spec as a real effort either, as it is written by a closed group with no real attempts to get the wider community involved.

My affiliation

I’m employed by Mozilla and Mozilla is a member of WHATWG and I have colleagues working on the WHATWG URL spec and other work items of theirs but it makes absolutely no difference to what I’ve written here. I also participate in the IETF and I consider myself friends with authors of RFC 1738, RFC 3986 and others but that doesn’t matter here either. My opinions are my own and this is my personal blog.

HTTP/2 in April 2016

On April 12 I had the pleasure of doing another talk in the Google Tech Talk series arranged in the Google Stockholm offices. I had given it the title “HTTP/2 is upon us, and here’s what you need to know about it.” in the invitation.

The room seated 70 persons but we had the amazing amount of over 300 people in the waiting line who unfortunately didn’t manage to get a seat. To those, and to anyone else who cares, here’s the video recording of the event.

If you’ve seen me talk about HTTP/2 before, you might notice that I’ve refreshed the material somewhat since before.

Summers are for HTTP

stockholm castle and ship
Stockholm City, as photographed by Michael Caven

In July 2015, 40-something HTTP implementers and experts of the world gathered in the city of Münster, Germany, to discuss nitty gritty details about the HTTP protocol during four intense days. Representatives for major browsers, other well used HTTP tools and the most popular HTTP servers were present. We discussed topics like how HTTP/2 had done so far, what we thought we should fix going forward and even some early blue sky talk about what people could potentially see being subjects to address in a future HTTP/3 protocol.

You can relive the 2015 version somewhat from my daily blog entries from then that include a bunch of details of what we discussed: day one, two, three and four.

http workshopThe HTTP Workshop was much appreciated by the attendees and it is now about to be repeated. In the summer of 2016, the HTTP Workshop is again taking place in Europe, but this time as a three-day event slightly further up north: in the capital of Sweden and my home town: Stockholm. During 25-27 July 2016, we intend to again dig in deep.

If you feel this is something for you, then please head over to the workshop site and submit your proposal and show your willingness to attend. This year, I’m also joining the Program Committee and I’ve signed up for arranging some of the local stuff required for this to work out logistically.

The HTTP Workshop 2015 was one of my favorite events of last year. I’m now eagerly looking forward to this year’s version. It’ll be great to meet you here!

Stockholm
The city of Stockholm in summer sunshine