Category Archives: Web

web stuff

HTTP Workshop 2016, day -1

http workshop The HTTP Workshop 2016 will take place in Stockholm starting tomorrow Monday, as I’ve mentioned before. Today we’ll start off slowly by having a few pre workshop drinks and say hello to old and new friends.

I did a casual count, and out of the 40 attendees coming, I believe slightly less than half are newcomers that didn’t attend the workshop last year. We’ll see browser people come, more independent HTTP implementers, CDN representatives, server and intermediary developers as well as some friends from large HTTP operators/sites. I personally view my attendance to be primarily with my curl hat on rather than my Firefox one. Firmly standing in the client side trenches anyway.

Visitors to Stockholm these days are also lucky enough to arrive when the weather is possibly as good as it can get here with the warmest period through the summer so far with lots of sun and really long bright summer days.

News this year includes the @http_workshop twitter account. If you have questions or concerns for HTTP workshoppers, do send them that way and they might get addressed or at least noticed.

I’ll try to take notes and post summaries of each workshop day here. Of course I will fully respect our conference rules about what to reveal or not.

stockholm castle and ship

curl wants to QUIC

The interesting Google transfer protocol that is known as QUIC is being passed through the IETF grinding machines to hopefully end up with a proper “spec” that has been reviewed and agreed to by many peers and that will end up being a protocol that is thoroughly documented with a lot of protocol people’s consensus. Follow the IETF QUIC mailing list for all the action.

I’d like us to join the fun

Similarly to how we implemented HTTP/2 support early on for curl, I would like us to get “on the bandwagon” early for QUIC to be able to both aid the protocol development and serve as a testing tool for both the protocol and the server implementations but then also of course to get us a solid implementation for users who’d like a proper QUIC capable client for data transfers.

implementations

The current version (made entirely by Google and not the output of the work they’re now doing on it within the IETF) of the QUIC protocol is already being widely used as Chrome speaks it with Google’s services in preference to HTTP/2 and other protocol options. There exist only a few other implementations of QUIC outside of the official ones Google offers as open source. Caddy offers a separate server implementation for example.

the Google code base

For curl’s sake, it can’t use the Google code as a basis for a QUIC implementation since it is C++ and code used within the Chrome browser is really too entangled with the browser and its particular environment to become very good when converted into a library. There’s a libquic project doing exactly this.

for curl and others

The ideal way to implement QUIC for curl would be to create “nghttp2” alternative that does QUIC. An ngquic if you will! A library that handles the low level protocol fiddling, the binary framing etc. Done that way, a QUIC library could be used by more projects who’d like QUIC support and all people who’d like to see this protocol supported in those tools and libraries could join in and make it happen. Such a library would need to be written in plain C and be suitably licensed for it to be really interesting for curl use.

a needed QUIC library

I’m hoping my post here will inspire someone to get such a project going. I will not hesitate to join in and help it get somewhere! I haven’t started such a project myself because I think I already have enough projects on my plate so I fear I wouldn’t be a good leader or maintainer of a project like this. But of course, if nobody else will do it I will do it myself eventually. If I can think of a good name for it.

some wishes for such a library

  • Written in C, to offer the same level of portability as curl itself and to allow it to get used as extensions by other languages etc
  • FOSS-licensed suitably
  • It should preferably not “own” the socket but also work in-memory and to allow applications to do many parallel connections etc.
  • Non-blocking. It shouldn’t wait for things on its own but let the application do that.
  • Should probably offer both client and server functionality for maximum use.
  • What else?

No websockets over HTTP/2

There is no websockets for HTTP/2.

By this, I mean that there’s no way to negotiate or upgrade a connection to websockets over HTTP/2 like there is for HTTP/1.1 as expressed by RFC 6455. That spec details how a client can use Upgrade: in a HTTP/1.1 request to switch that connection into a websockets connection.

Note that websockets is not part of the HTTP/1 spec, it just uses a HTTP/1 protocol detail to switch an HTTP connection into a websockets connection. Websockets over HTTP/2 would similarly not be a part of the HTTP/2 specification but would be separate.

(As a side-note, that Upgrade: mechanism is the same mechanism a HTTP/1.1 connection can get upgraded to HTTP/2 if the server supports it – when not using HTTPS.)

chinese-socket

Draft

There’s was once a draft submitted that describes how websockets over HTTP/2 could’ve been done. It didn’t get any particular interest in the IETF HTTP working group back then and as far as I’ve seen, there has been very little general interest in any group to pick up this dropped ball and continue running. It just didn’t go any further.

This is important: the lack of websockets over HTTP/2 is because nobody has produced a spec (and implementations) to do websockets over HTTP/2. Those things don’t happen by themselves, they actually require a bunch of people and implementers to believe in the cause and work for it.

Websockets over HTTP/2 could of course have the benefit that it would only be one stream over the connection that could serve regular non-websockets traffic at the same time in many other streams, while websockets upgraded on a HTTP/1 connection uses the entire connection exclusively.

Instead

So what do users do instead of using websockets over HTTP/2? Well, there are several options. You probably either stick to HTTP/2, upgrade from HTTP/1, use Web push or go the WebRTC route!

If you really need to stick to websockets, then you simply have to upgrade to that from a HTTP/1 connection – just like before. Most people I’ve talked to that are stuck really hard on using websockets are app developers that basically only use a single connection anyway so doing that HTTP/1 or HTTP/2 makes no meaningful difference.

Sticking to HTTP/2 pretty much allows you to go back and use the long-polling tricks of the past before websockets was created. They were once rather bad since they would waste a connection and be error-prone since you’d have a connection that would sit idle most of the time. Doing this over HTTP/2 is much less of a problem since it’ll just be a single stream that won’t be used that much so it isn’t that much of a waste. Plus, the connection may very well be used by other streams so it will be less of a problem with idle connections getting killed by NATs or firewalls.

The Web Push API was brought by W3C during 2015 and is in many ways a more “webby” way of doing push than the much more manual and “raw” method that websockets is. If you use websockets mostly for push notifications, then this might be a more convenient choice.

Also introduced after websockets, is WebRTC. This is a technique introduced for communication between browsers, but it certainly provides an alternative to some of the things websockets were once used for.

Future

Websockets over HTTP/2 could still be done. The fact that it isn’t done just shows that there isn’t enough interest.

Non-TLS

Recall how browsers only speak HTTP/2 over TLS, while websockets can also be done over plain TCP. In fact, the only way to upgrade a HTTP connection to websockets is using the HTTP/1 Upgrade: header trick, and not the ALPN method for TLS that HTTP/2 uses to reduce the number of round-trips required.

If anyone would introduce websockets over HTTP/2, they would then probably only be possible to be made over TLS from within browsers.

curl on windows versions

I had to ask. Just to get a notion of which Windows versions our users are actually using, so that we could get an indication which versions we still should make an effort to keep working on. As people download and run libcurl on their own, we just have no other ways to figure this out.

As always when asking a question to our audience, we can’t really know which part of our users that responded and it is probably more safe to assume that it is not a representative distribution of our actual user base but it is simply as good as it gets. A hint.

I posted about this poll on the libcurl mailing list and over twitter. I had it open for about 48 hours. We received 86 responses. Click the image below for the full res version:

windows-versions-used-for-curlSo, Windows 10, 8 and 7 are very well used and even Vista and XP clocked in fairly high on 14% and 23%. Clearly those are Windows versions we should strive to keep supported.

For Windows versions older than XP I was sort of hoping we’d get a zero, but as you can see in the graph we have users claiming to use curl on as old versions as Windows NT 4. I even checked, and it wasn’t the same two users that checked all those three oldest versions.

The “Other” marks were for Windows 2008 and 2012, and bonus points for the user who added “Other: debian 7”. It is interesting that I specifically asked for users running curl on windows to answer this survey and yet 26% responded that they don’t use Windows at all..

everybody runs this code all the time

I was invited to talk about curl at the recent FOSS North conference in Gothenburg on May 26th. It was the first time the conference ran, but I think it went smooth and the ~110 visitors seemed to have a good time. It was a single track and there was a fairly good and interesting mix of talkers and subjects I think. They’re already planning to make it return again in spring 2017, so if you’re into FOSS and you’re in the Nordic region, consider this event next year…

I took on the subject of talking about my hacker ring^W^Wcurl project insights. Here’s my slide set:

At the event I sat down and had a chat with Simon Campanello, a reporter at IDG Techworld here in Sweden who subsequently posted this article about curl (in Swedish) and how our code has ended up getting used so widely.

photo of me from the Techworld article

My URL isn’t your URL

URLs

When I started the precursor to the curl project, httpget, back in 1996, I wrote my first URL parser. Back then, the universal address was still called URL: Uniform Resource Locators. That spec was published by the IETF in 1994. The term “URL” was then used as source for inspiration when naming the tool and project curl.

The term URL was later effectively changed to become URI, Uniform Resource Identifiers (published in 2005) but the basic point remained: a syntax for a string to specify a resource online and which protocol to use to get it. We claim curl accepts “URLs” as defined by this spec, the RFC 3986. I’ll explain below why it isn’t strictly true.

There was also a companion RFC posted for IRI: Internationalized Resource Identifiers. They are basically URIs but allowing non-ascii characters to be used.

The WHATWG consortium later produced their own URL spec, basically mixing formats and ideas from URIs and IRIs with a (not surprisingly) strong focus on browsers. One of their expressed goals is to “Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process“. They want to go back and use the term “URL” as they rightfully state, the terms URI and IRI are just confusing and no humans ever really understood them (or often even knew they exist).

The WHATWG spec follows the good old browser mantra of being very liberal in what it accepts and trying to guess what the users mean and bending backwards trying to fulfill. (Even though we all know by now that Postel’s Law is the wrong way to go about this.) It means it’ll handle too many slashes, embedded white space as well as non-ASCII characters.

From my point of view, the spec is also very hard to read and follow due to it not describing the syntax or format very much but focuses far too much on mandating a parsing algorithm. To test my claim: figure out what their spec says about a trailing dot after the host name in a URL.

On top of all these standards and specs, browsers offer an “address bar” (a piece of UI that often goes under other names) that allows users to enter all sorts of fun strings and they get converted over to a URL. If you enter “http://localhost/%41” in the address bar, it’ll convert the percent encoded part to an ‘A’ there for you (since 41 in hex is a capital A in ASCII) but if you type “http://localhost/A A” it’ll actually send “/A%20A” (with a percent encoded space) in the outgoing HTTP GET request. I’m mentioning this since people will often think of what you can enter there as a “URL”.

The above is basically my (skewed) perspective of what specs and standards we have so far to work with. Now we add reality and let’s take a look at what sort of problems we get when my URL isn’t your URL.

So what  is a URL?

Or more specifically, how do we write them. What syntax do we use.

I think one of the biggest mistakes the WHATWG spec has made (and why you will find me argue against their spec in its current form with fierce conviction that they are wrong), is that they seem to believe that URLs are theirs to define and work with and they limit their view of URLs for browsers, HTML and their address bars. Sure, they are the big companies behind the browsers almost everyone uses and URLs are widely used by browsers, but URLs are still much bigger than so.

The WHATWG view of a URL is not widely adopted outside of browsers.

colon-slash-slash

If we ask users, ordinary people with no particular protocol or web expertise, what a URL is what would they answer? While it was probably more notable years ago when the browsers displayed it more prominently, the :// (colon-slash-slash) sequence will be high on the list. Seeing that marks the string as a URL.

Heck, going beyond users, there are email clients, terminal emulators, text editors, perl scripts and a bazillion other things out there in the world already that detects URLs for us and allows operations on that. It could be to open that URL in a browser, to convert it to a clickable link in generated HTML and more. A vast amount of said scripts and programs will use the colon-slash-slash sequence as a trigger.

The WHATWG spec says it has to be one slash and that a parser must accept an indefinite amount of slashes. “http:/example.com” and “http:////////////////////////////////////example.com” are both equally fine. RFC 3986 and many others would disagree. Heck, most people I’ve confronted the last few days, even people working with the web, seem to say, think and believe that a URL has two slashes. Just look closer at the google picture search screen shot at the top of this article, which shows the top images for “URL” google gave me.

We just know a URL has two slashes there (and yeah, file: URLs most have three but lets ignore that for now). Not one. Not three. Two. But the WHATWG doesn’t agree.

“Is there really any reason for accepting more than two slashes for non-file: URLs?” (my annoyed question to the WHATWG)

“The fact that all browsers do.”

The spec says so because browsers have implemented the spec.

No better explanation has been provided, not even after I pointed out that the statement is wrong and far from all browsers do. You may find reading that thread educational.

In the curl project, we’ve just recently started debating how to deal with “URLs” having another amount of slashes than two because it turns out there are servers sending back such URLs in Location: headers, and some browsers are happy to oblige. curl is not and neither is a lot of other libraries and command line tools. Who do we stand up for?

Spaces

A space character (the ASCII code 32, 0x20 in hex) cannot be part of a URL. If you want it sent, you percent encode it like you do with any other illegal character you want to be part of the URL. Percent encoding is the byte value in hexadecimal with a percent sign in front of it. %20 thus means space. It also means that a parser that for example scans for URLs in a text knows that it reaches the end of the URL when the parser encounters a character that isn’t allowed. Like space.

Browsers typically show the address in their address bars with all %20 instances converted to space for appearance. If you copy the address there into your clipboard and then paste it again in your text editor you still normally get the spaces as %20 like you want them.

I’m not sure if that is the reason, but browsers also accept spaces as part of URLs when for example receiving a redirect in a HTTP response. That’s passed from a server to a client using a Location: header with the URL in it. The browsers happily allow spaces in that URL, encode them as %20 and send out the next request. This forced curl into accepting spaces in redirected “URLs”.

Non-ASCII

Making URLs support non-ASCII languages is of course important, especially for non-western societies and I’ve understood that the IRI spec was never good enough. I personally am far from an expert on these internationalization (i18n) issues so I just go by what I’ve heard from others. But of course users of non-latin alphabets and typing systems need to be able to write their “internet addresses” to resources and use as links as well.

In an ideal world, we would have the i18n version shown to users and there would be the encoded ASCII based version below, to get sent over the wire.

For international domain names, the name gets converted over to “punycode” so that it can be resolved using the normal system name resolvers that know nothing about non-ascii names. URIs have no IDN names, IRIs do and WHATWG URLs do. curl supports IDN host names.

WHATWG states that URLs are specified as UTF-8 while URIs are just ASCII. curl gets confused by non-ASCII letters in the path part but percent encodes such byte values in the outgoing requests – which causes “interesting” side-effects when the non-ASCII characters are provided in other encodings than UTF-8 which for example is standard on Windows…

Similar to what I’ve written above, this leads to servers passing back non-ASCII byte codes in HTTP headers that browsers gladly accept, and non-browsers need to deal with…

No URL standard

I’ve not tried to write a conclusive list of problems or differences, just a bunch of things I’ve fallen over recently. A “URL” given in one place is certainly not certain to be accepted or understood as a “URL” in another place.

Not even curl follows any published spec very closely these days, as we’re slowly digressing for the sake of “web compatibility”.

There’s no unified URL standard and there’s no work in progress towards that. I don’t count WHATWG’s spec as a real effort either, as it is written by a closed group with no real attempts to get the wider community involved.

My affiliation

I’m employed by Mozilla and Mozilla is a member of WHATWG and I have colleagues working on the WHATWG URL spec and other work items of theirs but it makes absolutely no difference to what I’ve written here. I also participate in the IETF and I consider myself friends with authors of RFC 1738, RFC 3986 and others but that doesn’t matter here either. My opinions are my own and this is my personal blog.

fcurl is fread and friends for URLs

This whole family of functions, fopen, fread, fwrite, fgets, fclose and more are defined in the C standard since C89. You can’t really call yourself a C programmer without knowing them and probably even using them in at least a few places.

The charm with these is that they’re standard, they’re easy to use and they’re available everywhere where there’s a C compiler.

A basic example that just reads a file from disk and writes it to stdout could look like this:

FILE *file;

file = fopen("hello.txt", "r");
if(file) {
  char buffer [256];
  while(1) {
    size_t rc = fread(buffer, sizeof(buffer),
                1, file);
    if(rc > 0)
      fwrite(buffer, rc, 1, stdout);
    else
      break;
  }
  fclose(file);
}

Imagine you’d like to switch this example, or one of your actual real world programs that use the fopen() family of functions to read or write files, and instead read and write files from and to the Internet instead using your favorite Internet protocols. How would you do that without having to change your code a lot and do a major refactoring job?

Enter fcurl

I’ve started to work on a library that provides a look-alike API with matching functions and behaviors, but that allows fopen() to instead specify a URL instead of a file name. I call it fcurl. (Much inspired by the libcurl example fopen.c, which I wrote the first version of already back in 2002!)

It is of course open source and is powered by libcurl.

The project is in its early infancy. I think it would be interesting to try it out and I’ve mentioned the idea to a few people that have shown interest. I really can’t make this happen all on my own anyway so while I’ve created a first embryo, it will take some time before it gets truly useful. Help from others would be greatly appreciated of course.

Using this API, a version of the above example that instead reads data from a HTTPS site instead of a local file could look like:

FCURL *file;

file = fcurl_open("https://daniel.haxx.se/",
                  "r");
if(file) {
  char buffer [256];
  while(1) {
    size_t rc = fcurl_read(buffer,         
                           sizeof(buffer), 1, 
                           file);
    if(rc > 0)
      fwrite(buffer, rc, 1, stdout);
    else
      break;
  }
  fcurl_close(file);
}

And it could even actually also read a local file using the file:// sheme.

Drop-in replacement

The idea here is to make the alternative functions have new names but as far as possible accept the same input arguments, return the same return codes and so on.

If we do it right, you could possibly even convert an existing program with just a set of #defines at the top without even having to change the code!

Something like this:

#define FILE FCURL
#define fopen(x,y) fcurl_open(x, y)
#define fclose(x) fcurl_close(x)

I think it is worth considering a way to provide an official macro set like that for those who’d like to switch easily (?) and quickly.

Fun things to consider

1. for non-scheme input, use normal fopen?

An interesting take is probably to make fcurl_open() treat input specified without a “scheme://” to be a local file, and then passed to fopen() instead under the hood. That would then enable even more code to switch to fcurl since all the existing use cases with local file names would just continue to work.

2. LD_PRELOAD

An interesting area of deeper research around this could be to provide a way to LD_PRELOAD replacements for the functions so that not even any source code would need be changed and already built existing binaries could be given this functionality.

3. fopencookie

There’s also the GNU libc’s fopencookie concept to figure out if that is something for fcurl to support/use. BSD and OS X have something similar called funopen.

4. merge in official libcurl

If this turns out useful, appreciated and good. We could consider moving the API in under the curl project’s umbrella and possibly eventually even making it part of the actual libcurl. But hey, we’re far away from that and I’m not saying that is even the best idea…

Your input is valuable

Please file issues or pull-requests. Let’s see where we can take this!

HTTP/2 in April 2016

On April 12 I had the pleasure of doing another talk in the Google Tech Talk series arranged in the Google Stockholm offices. I had given it the title “HTTP/2 is upon us, and here’s what you need to know about it.” in the invitation.

The room seated 70 persons but we had the amazing amount of over 300 people in the waiting line who unfortunately didn’t manage to get a seat. To those, and to anyone else who cares, here’s the video recording of the event.

If you’ve seen me talk about HTTP/2 before, you might notice that I’ve refreshed the material somewhat since before.

Summers are for HTTP

stockholm castle and ship
Stockholm City, as photographed by Michael Caven

In July 2015, 40-something HTTP implementers and experts of the world gathered in the city of Münster, Germany, to discuss nitty gritty details about the HTTP protocol during four intense days. Representatives for major browsers, other well used HTTP tools and the most popular HTTP servers were present. We discussed topics like how HTTP/2 had done so far, what we thought we should fix going forward and even some early blue sky talk about what people could potentially see being subjects to address in a future HTTP/3 protocol.

You can relive the 2015 version somewhat from my daily blog entries from then that include a bunch of details of what we discussed: day one, two, three and four.

http workshopThe HTTP Workshop was much appreciated by the attendees and it is now about to be repeated. In the summer of 2016, the HTTP Workshop is again taking place in Europe, but this time as a three-day event slightly further up north: in the capital of Sweden and my home town: Stockholm. During 25-27 July 2016, we intend to again dig in deep.

If you feel this is something for you, then please head over to the workshop site and submit your proposal and show your willingness to attend. This year, I’m also joining the Program Committee and I’ve signed up for arranging some of the local stuff required for this to work out logistically.

The HTTP Workshop 2015 was one of my favorite events of last year. I’m now eagerly looking forward to this year’s version. It’ll be great to meet you here!

Stockholm
The city of Stockholm in summer sunshine

HTTP redirects

I find that many web minded people working client-side or even server-side have neglected to learn the subtle details of the redirects of today. Here’s my attempt at writing another text about it that the ones who should read it still won’t.

Nothing here, go there!

The “redirect” is a fundamental part of the HTTP protocol. The concept was present and is documented already in the first spec (RFC 1945), published in 1996, and it has remained well used ever since.

A redirect is exactly what it sounds like. It is the sredirect-signerver sending back an instruction to the client – instead of giving back the contents the client wanted. The server basically says “go look over [here] instead for that thing you asked for“.

But not all redirects are alike. How permanent is the redirect? What request method should the client use in the next request?

All redirects also need to send back a Location: header with the new URI to ask for, which can be absolute or relative.

Permanent or Temporary

Is the redirect meant to last or just remain for now? If you want a GET to resource A permanently redirect users to resource B with another GET, send back a 301. It also means that the user-agent (browser) is meant to cache this and keep going to the new URI from now on when the original URI is requested.

The temporary alternative is 302. Right now the server wants the client to send a GET request to B, but it shouldn’t cache this but keep trying the original URI when directed to it.

Note that both 301 and 302 will make browsers do a GET in the next request, which possibly means changing method if it started with a POST (and only if POST). This changing of the HTTP method to GET for 301 and 302 responses is said to be “for historical reasons”, but that’s still what browsers do so most of the public web will behave this way.

In practice, the 303 code is very similar to 302. It will not be cached and it will make the client issue a GET in the next request. The differences between a 302 and 303 are subtle, but 303 seems to be more designed for an “indirect response” to the original request rather than just a redirect.

These three codes were the only redirect codes in the HTTP/1.0 spec.

GET or POST?

All three of these response codes, 301 and 302/303, will assume that the client sends a GET to get the new URI, even if the client might’ve sent a POST in the first request. This is very important, at least if you do something that doesn’t use GET.

If the server instead wants to redirect the client to a new URI and wants it to send the same method in the second request as it did in the first, like if it first sent POST it’d like it to send POST again in the next request, the server would use different response codes.

To tell the client “the URI you sent a POST to, is permanently redirected to B where you should instead send your POST now and in the future”, the server responds with a 308. And to complicate matters, the 308 code is only recently defined (the spec was published in June 2014) so older clients may not treat it correctly! If so, then the only response code left for you is…

The (older) response code to tell a client to send a POST also in the next request but temporarily is 307. This redirect will not be cached by the client though so it’ll again post to A if requested to. The 307 code was introduced in HTTP/1.1.

Oh, and redirects work the exact same way in HTTP/2 as they do in HTTP/1.1.

The helpful table version

Permanent Temporary
Switch to GET 301 302 and 303
Keep original method 308 307

It’s a gap!

Yes. The 304, 305, and 306 codes are not used for redirects at all.

What about other HTTP methods?

I decided to simplify the explanation above. In all places where it says POST above, you can replace it with any non-GET method. They’re just slightly less common on the browser centric web.

curl and redirects

I couldn’t write a text like this without spicing it up with some curl details!

First, curl and libcurl don’t follow redirects by default. You need to ask curl to do it with -L (or –location) or libcurl with CURLOPT_FOLLOWLOCATION.

It turns out that there are web services out there in the world that want a POST sent, are responding with HTTP redirects that use a 301, 302 or 303 response code and still want the HTTP client to send the next request as a POST. As explained above, browsers won’t do that and neither will curl – by default.

Since these setups exist, and they’re actually not terribly rare, curl offers options to alter its behavior.

You can tell curl to not change the non-GET request method to GET after a 30x response by using the dedicated options for that:
–post301, –post302 and –post303. If you are instead writing a libcurl based application, you control that behavior with the CURLOPT_POSTREDIR option.

Here’s how a simple HTTP/1.1 redirect can look like. Note the 301, this is “permanent”:
curl-shows-redirect