dotdot removal in libcurl 7.32.0

Tuesday, July 30th, 2013

Allow as much as possible and only sanitize what’s absolutely necessary.

That has basically been the rule for the URL parser in curl and libcurl since the project was started in the 90s. The upside with this is that you can use curl to torture your web servers with tests and you can handicraft really imaginary stuff to send and thus subsequently to receive. It kind of assumes that the user truly gives curl a URL the user wants to use.

Why would you give curl a broken URL?

But of course life and internet protocols, and perhaps in particular HTTP, is more involved than that. It soon becomes more complicated.

Redirects

Everyone who’s writing a web user-agent based on RFC 2616 soon faces the fact that redirects based on the Location: header is a source of fun and head-scratching. It is defined in the spec as only allowing “absolute URLs” but the reality is that they were also provided as relative ones by web servers already from the start so the browsers of course support that (and the pending HTTPbis document is already making this clear). curl thus also adopted support for relative URLs, meaning the ability to “merge” or “add” a relative URL onto a previously used absolute one had to be implemented. And even illegally constructed URLs are done this way and in the grand tradition of web browsers, they have not tried to stop users from doing bad things, they have instead adapted and now instead try to convert it to what the user could’ve meant. Like for example using a white space within the URL you send in a Location: header. Even curl has to sanitize that so that it works more like the browsers.

Relative path segments

The path part of URLs are truly to be seen as a path, in that it is a hierarchical scheme where each slash-separated part adds a piece. Like “/first/second/third.html”

As it turns out, you can also include modifiers in the path that have special meanings. Like the “..” (two dots or periods next to each other) known from shells and command lines to mean “one directory level up” can also be used in the path part of a URL like “/one/three/../two/three.html” which equals “/one/two/three.html” when the dotdot sequence is handled. This dot removal procedure is documented in the generic URL specification RFC 3986 (published January 2005) and is completely protocol agnostic. It works like this for HTTP, FTP and every other protocol you provide a path part for.

In its traditional spirit of just accepting and passing along, curl didn’t use to treat “dotdots” in any particular way but handed it over to the server to deal with. There probably aren’t that terribly many such occurrences either so it never really caused any problems or made any users hit any particular walls (or they were too shy to report it); until one day back in February this year… so we finally had to do something about this. Some 8 years after the spec saying it must be done was released.

dotdot removal

Alas, libcurl 7.32.0 now features (once it gets released around August 12th) full traversal and handling of such sequences in the path part of URLs. It also includes single dot sequences like in “/one/./two”. libcurl will detect such uses and convert the path to a sequence without them and continue on. This of course will cause a limited altered behavior for the possible small portion of users out there in the world who would use dotdot sequences and actually want them to get sent as-is the way libcurl has been doing it. I decided against adding an option for disabling this behavior, but of course if someone would experience terrible pain and can reported about it convincingly to us we could possible reconsider that decision in the future.

I suspect (and hope) this will just be another little change along the way that will make libcurl act more standard and more like the browsers and thus cause less problems to users but without people much having to care about how or why.

Further reading: the dotdot.c file from the libcurl source tree!

Bonus kit

A dot to dot surprise drawing for you and your kids (click for higher resolution)

curl dot-to-dot

curl and new technologies

Wednesday, March 6th, 2013

At March 5 2013 we had another foss-sthlm meetup. The 12th one in fact, and out of the five talks during the event I spoke about curl and new technologies. Here are the slides from my talk:

HTTP2, SPDY and spindly right now

Saturday, December 1st, 2012

SPDYOn November 28, the HTTPbis group within the IETF published the first draft for the upcoming HTTP2 protocol. What is being posted now is a start and a foundation for further discussions and changes. It is basically an import of the SPDY version 3 protocol draft.

There’s been a lot of resistance within the HTTPbis to the mandated TLS that SPDY has been promoting so far and it seems unlikely to reach a consensus as-is. There’s also been a lot of discussion and debate over the compression SPDY uses. Not only because of the pre-populated dictionary that might already be a little out of date or the fact that gzip compression consumes a notable amount of memory per stream, but also recently the security aspect to compression thanks to the CRIME attack.

Meanwhile, the discussions on the spdy development list have brought up several changes to the version 3 that are suggested and planned to become part of the version 4 that is work in progress. Including a new compression algorithm, shorter length fields (now 16bit) and more. Recently discussions have brought up a need for better flexibility when it comes to prioritization and especially changing prio run-time. For like when browser users switch tabs or simply scroll down the page and you rather have the images you have in sight to load before the images you no longer have in view…

I started my work on Spindly a little over a year ago to build a stand-alone library, primarily intended for libcurl so that we could soon offer SPDY downloads for it. We’re still only on SPDY protocol 2 there and I’ve failed to attract any fellow developers to the project and my own lack of time has basically made the project not evolve the way I wanted it to. I haven’t given up on it though. I hope to be able to get back to it eventually, very much also depending on how the HTTPbis talk goes. I certainly am determined to have libcurl be part of the upcoming HTTP2 experiments (even if that is not happening very soon) and spindly might very well be the infrastructure that powers libcurl then.

We’ll see…

HTTP2 Expression of Interest: curl

Friday, July 13th, 2012

For the readers of my blog, this is a copy of what I posted to the httpbis mailing list on July 12th 2012.

Hi,

This is a response to the httpis call for expressions of interest

BACKGROUND

I am the project leader and maintainer of the curl project. We are the open source project that makes libcurl, the transfer library and curl the command line tool. It is among many things a client-side implementation of HTTP and HTTPS (and some dozen other application layer protocols). libcurl is very portable and there exist around 40 different bindings to libcurl for virtually all languages/enviornments imaginable. We estimate we might have upwards 500
million users
or so. We’re entirely voluntary driven without any paid developers or particular company backing.

HTTP/1.1 problems I’d like to see adressed

Pipelining – I can see how something that better deals with increasing bandwidths with stagnated RTT can improve the end users’ experience. It is not easy to implement in a nice manner and provide in a library like ours.

Many connections – to avoid problems with pipelining and queueing on the connections, many connections are used and and it seems like a general waste that can be improved.

HTTP/2.0

We’ve implemented HTTP/1.1 and we intend to continue to implement any and all widely deployed transport layer protocols for data transfers that appear on the Internet. This includes HTTP/2.0 and similar related protocols.

curl has not yet implemented SPDY support, but fully intends to do so. The current plan is to provide SPDY support with the help of spindly, a separate SPDY library project that I lead.

We’ve selected to support SPDY due to the momentum it has and the multiple existing implementaions that A) have multi-company backing and B) prove it to be a truly working concept. SPDY seems to address HTTP’s pipelining and many-connections problems in a decent way that appears to work in reality too. I believe SPDY keeps enough HTTP paradigms to be easily upgraded to for most parties, and yet the ones who can’t or won’t can remain with HTTP/1.1 without too much pain. Also, while Spindly is not production-ready, it has still given me the sense that implementing a SPDY protocol engine is not rocket science and that the existing protocol specs are good.

By relying on external libs for protocol and implementation details, my hopes is that we should be able to add support for other potentially coming HTTP/2.0-ish protocols that gets deployed and used in the wild. In the curl project we’re unfortunately rarely able to be very pro-active due to the nature of our contributors, which tends to make us follow the rest and implement and go with what others have already decided to go with.

I’m not aware of any competitors to SPDY that is deployed or used to any particular and notable extent on the public internet so therefore no other ”HTTP/2.0 protocol” has been considered by us. The two biggest protocol details people will keep mention that speak against SPDY is SSL and the compression requirements, yet I like both of them. I intend to continue to participate in dicussions and technical arguments on the ietf-http-wg mailing list on HTTP details for as long as I have time and energy.

HTTP AUTH

curl currently supports Basic, Digest, NTLM and Negotiate for both host and proxy.

Similar to the HTTP protocol, we intend to support any widely adopted authentication protocols. The HOBA, SCRAM and Mutual auth suggestions all seem perfectly doable and fine in my perspective.

However, if there’s no proper logout mechanism provided for HTTP auth I don’t forsee any particular desire from browser vendor or web site creators to use any of these just like they don’t use the older ones either to any significant extent. And for automatic (non-browser) uses only, I’m not sure there’s motivation enough to add new auth protocols to HTTP as at least historically we seem to rarely be able to pull anything through that isn’t pushed for by at least one of the major browsers.

The “updated HTTP auth” work should be kept outside of the HTTP/2.0 work as far as possible and similar to how RFC2617 is separate from RFC2616 it should be this time around too. The auth mechnism should not be too tightly knit to the HTTP protocol.

The day we can clense connection-oriented authentications like NTLM from the HTTP world will be a happy day, as it’s state awareness is a pain to deal with in a generic HTTP library and code.

Travel for fun or profit

Thursday, March 15th, 2012

As a protocol geek I love working in my open source projects curl, libssh2, c-ares and spindly. I also participate in a few related IETF working groups around these protocols, and perhaps primarily I enjoy the HTTPbis crowd.

Meanwhile, I’m a consultant during the day and most of my projects and assignments involve embedded systems and primarily embedded Linux. The protocol part of my life tends to be left to get practiced during my “copious” amount of spare time – you know that time after your work, after you’ve spent time with your family and played with your kids and done the things you need to do at home to keep the household in a decent shape. That time when the rest of the family has gone to bed and you should too but if you did when would you ever get time to do that fun things you really want to do?

IETF has these great gatherings every now and then and they’re awesome places to just drown in protocol mumbo jumbo for several days. They’re being hosted by various cities all over the world so often I deem them too far away or too awkward to go to, also a lot because I rarely have any direct monetary gain or compensation for going but rather I’d have to do it as a vacation and pay for it myself.

IETF 83 is going to be held in Paris during March 25-30 and it is close enough for me to want to go and HTTPbis and a few other interesting work groups are having scheduled meetings. I really considered going, at least to meet up with HTTP friends.

Something very rare instead happened that prevents me from going there! My customer (for whom I work full-time since about six months and shall remain nameless for now) asked me to join their team and go visit the large embedded conference ESC in San Jose, California in the exact same week! It really wasn’ t a hard choice for me, since this is my job and being asked to do something because I’m wanted is a nice feeling and position – and they’re paying me to go there. It will also be my first time in California even though I guess I won’t get time to actually see much of it.

I hope to write a follow-up post later on about what I’m currently working with, once it has gone public.

Pointless respecifying FTP URI

Tuesday, May 24th, 2011

There’s this person wiIETFthin IETF who seems to possess endless energy and a never-ending wish to clean up tiny details within the IETF processes. He continuously digs up specifications that need to be registered or submitted again somewhere due to some process. Often under loud protests from fellow IETFers since it steals time and energy from people on the lists for discussions and reviews – only to satisfy some administrative detail. Time and energy perhaps better spent on things like new protocols and interesting new technologies.

This time, he has deemed that the FTP a FILE URI specs need to be registered properly, and alas he has submitted his first suggested update of the FTP URI.

From my work with curl I have of course figured out a few problems with RFC1738 that I don’t think we should just repeat in a new version of the spec. It turns out I’m not alone in thinking this work isn’t really good like this, and I posted a second mail to clarify my points.

We’re not working on fixing the problems with FTP URIs that are present in RFC1738 so just rephrasing those into a new spec is a bad idea.

We could possibly start the work on fixing the problems, but so far I’ve seen no such will or efforts and I don’t plan on pushing for that myself either.

Please tell me or the ftpext2 group where I or the others are wrong or right or whatever!

The cookie RFC 6265

Thursday, April 28th, 2011

http://www.rfc-editor.org/rfc/rfc6265.txt is out!

Back when I was a HTTP rookie in the late 90s, I once expected that there was this fine RFC document somewhere describing how to do HTTP cookies. I was wrong. A lot of others have missed that document too, both before and after my initial search.

I was wrong in the sense that sure there were RFCs for cookies. There were even two of them (RFC2109 and RFC2965)! The only sad thing was however that both of them were totally pointless as in effect nobody (servers nor clients) implemented cookies like that so they documented idealistic protocols that didn’t exist in the real world. This sad state has made people fall into cookie problems all the way into modern days when they’ve implemented services according to those RFCs and then blame their browser for failing.

cookie

It turned out that the only document that existed that were being used, was the original Netscape cookie document. It can’t even be called a specification because it is so short and is so lacking in details that it leaves large holes open and forces implementors to guess about the missing pieces. A sweet irony in itself is the fact that even Netscape removed the document from their site so the only place to find this document is at archive.org or copies like the one I link to above at the curl.haxx.se site. (For some further and more detailed reading about the history of cookies and a bunch of the flaws in the protocol/design, I recommend Michal Zalewski’s excellent blog post HTTP cookies, or how not to design protocols.)

While HTTP was increasing in popularity as a protocol during the 00s and still is, and more and more stuff get done in browsers and everything and everyone are using cookies, the protocol was still not documented anywhere as it was actually used.

Somewhat modeled after the httpbis working group (which is working on updating and bugfixing the HTTP 1.1 spec), IETF setup a mailing list named httpstate in the early 2009 to start discussing what problems there are with cookies and all related matters. After lively discussions throughout the year, the working group with the same name as the mailinglist was founded at December 11th 2009.

One of the initial sparks to get the httpstate group going came from Bill Corry who said this about the start:

In late 2008, Jim Manico and I connected to create a specification for
HTTPOnly — we saw the security issues arising from how the browser vendors
were implementing HTTPOnly in varying ways[1] due to a lack of a specification
and formed an ad-hoc working group to tackle the issue[2].
When I approached the IETF about forming a charter for an official working
group, I was told that I was <quote> “wasting my time” because cookies itself
did not have a proper specification, so it didn’t make sense to work on a spec
for HTTPOnly.  Soon after, we pursued reopening the IETF httpstate Working
Group to tackle the entire cookie spec, not just HTTPOnly.  Eventually Adam
Barth would become editor and Jeff Hodges our chair.

In late 2008, Jim Manico and I connected to create a specification for HTTPOnly — we saw the security issues arising from how the browser vendors were implementing HTTPOnly in varying ways[1] due to a lack of a specification and formed an ad-hoc working group to tackle the issue[2].

When I approached the IETF about forming a charter for an official working group, I was told that I was <quote> “wasting my time” because cookies itself did not have a proper specification, so it didn’t make sense to work on a spec for HTTPOnly.  Soon after, we pursued reopening the IETF httpstate Working Group to tackle the entire cookie spec, not just HTTPOnly.  Eventually Adam Barth would become editor and Jeff Hodges our chair.

Since then Adam Barth has worked fiercely as author of the specification and lots of people have joined in and contributed their views, comments and experiences, and we have over time really nailed down how cookies work in the wild today. The current spec now actually describes how to send and receive cookies, the way it is done by existing browsers and clients. Of course, parts of this new spec say things I don’t think it should, like how it deals with the order of cookies in headers, but as everything in life we needed to compromise and I seemed to be rather lonely on my side of that “fence”.

I must stress that the work has only involved to document how things work today and not to invent or create anything new. We don’t fix any of the many known problems with cookies, but we describe how you write your protocol implementation if you want to interact fine with existing infrastructure.

The new spec explicitly obsoletes the older RFC2965, but doesn’t obsolete RFC2109. That was done already by RFC2965. (I updated this paragraph after my initial post.)

Oh, and yours truly is mentioned in the ending “acknowledgements” section. It’s actually the second RFC I get to be mentioned in, the first being RFC5854.

Future

I am convinced that I will get reason to get back to the cookie topic soon and describe what is being worked on for the future. Once the existing cookies have been documented, there’s a desire among people to design something that overcomes the problems with the existing protocol. Adam’s CAKE proposal being one of the attempts and ideas in the pipe.

Another parallel IETF effort is the http-auth mailing list in which lots of discussions around HTTP authentication is being held, and as they often today involve cookies there’s a lot of talk about them there as well. See for example Timothy D. Morgan’s document Weaning the Web off of Session Cookies.

I’ll certainly track the development. And possibly even participate in shaping how this will go. We’ll see.

(cookie image source)

Cookies and Websockets and HTTP headers

Tuesday, February 1st, 2011

So yesterday we held a little HTTP-related event in Stockholm, arranged by OWASP Sweden. We talked a bit about cookies, websockets and recent HTTP headers.

Below are all the slides from the presentations I, Martin Holst Swende and John Wilanders did. (The entire event was done in Swedish.)

Martin Holst Swende’s talk:

John Wilander’s slides from his talk are here:

Byte ranges for FTP

Monday, December 20th, 2010

In the IETF ftpext2 working group there have been some talks around clients’ and servers’ ability to do and support “ranged” file transfers, that is transferring only a piece of any given file. FTP supports the REST command and has done so since the dawn of man (RFC765 – June 1980), and using that command, a client can set the starting point for a transfer but there is no way to set the end point. HTTP has supported the Range: header since the first HTTP 1.1 spec back in January 1997, and that supports both a start and an end point. The HTTP header does in fact support multiple ranges within the same header, but let’s not overdo it here!

Currently, to avoid getting an entire file a client would simply close the data connection when it has got all the data it wants. The unfortunate reality is that some servers don’t notice clients doing this, so in order for this to work reliably a client also has to send ABOR, and after this command has been sent there is no way for the client to reliably figure out the state of the control connection so it has to get closed as well (which is crap in case more files are to be transferred to or from the same host). It primarily becomes unreliable because when ABOR is sent, the client gets one or two responses back due to a race condition between the closing and the actual end of transfer etc, and it isn’t possible to tell exactly how to continue.

A solution for the future is being worked on. I’ve joined up the effort to write a spec that will suggest a new FTP command that sets the end point for a transfer in the same vein REST sets the start point. For the moment, we’ve named our suggested command RANG (as short for range). “We” in this context means Tatsuhiro Tsujikawa, Anthony Bryan and myself but we of course hope to get further valuable feedback by the great ftpext2 people.

There already are use cases that want range request for FTP. The people behind metalinks for example want to download the same file from many servers, and then it makes sense to be able to download little pieces from different sources.

The people who found the libcurl bugs I linked to above use libcurl as part of the Fedora/Redhat installer Anaconda, and if I understand things right they use this feature to just get the beginning of some files to check them out and avoid having to download the full file before it knows it truly wants it. Thus it saves lots of bandwidth.

In short, the use-cases for ranged FTP retrievals are quite likely pretty much the same ones as they are for HTTP!

The first RANG draft is now available.

Making SFTP transfers fast

Wednesday, December 8th, 2010

SFTP, the SSH File Transfer Protocol, is a misleading name. It gives you the impression that it might be something like a secure version of FTP, perhaps something like FTPS but modeled over SSH instead of SSL. But it isn’t!The OpenSSH fish

I think a more suitable name would’ve been SNFS or FSSSH. That is: networked file system operations over SSH, as that is in fact what SFTP is. The SFTP protocol is closer to NFS in nature than FTP. It is a protocol for sending and receiving binary packets over a (secure) SSH channel to read files, write files, and so on. But not on the basis of entire files, like FTP, but by sending OPEN file as FILEHANDLE, “WRITE this piece of data at OFFSET using FILEHANDLE” etc.

SFTP was being defined by a working group with IETF but the effort died before any specification was finalized. I wasn’t around then so I don’t know how this happened. During the course of their work, they released several drafts of the protocol using different protocol versions. Version 3, 4, 5 and 6 are the ones most used these days. Lots of SFTP implementations today still only implement the version 3 draft. (like libssh2 does for example)

Each packet in the SFTP protocol gets a response from the server to acknowledge it was received. It also includes an error code etc. So, the basic concept to write a file over SFTP is:

[client] OPEN <filehandle>
[server] OPEN OK
[client] WRITE <data> <filehandle> <offset 0> <size N>
[server] WRITE OK
[client] WRITE <data> <filehandle> <offset N> <size N>
[server] WRITE OK
[client] WRITE <data> <filehandle> <offset N*2> <size N>
[server] WRITE OK
[client] CLOSE <filehandle>
[server] CLOSE OK

This example obviously assumes the whole file was written in three WRITE packets. A single SFTP packet cannot be larger than 32768 bytes so if your client could read the entire file into memory, it can only send it away using very many small chunks. I don’t know the rationale for selecting such a very small maximum packet size, especially since the SSH channel layer over which SFTP packets are transferred over doesn’t have the same limitation but allows much larger ones! Interestingly, if you send a READ of N bytes from the server, you apparently imply that you can deal with packets of that size as then the server can send packets back that are N bytes (plus header)…

Enter network latency.

More traditional transfer protocols like FTP, HTTP and even SCP work on entire files. Roughly like “send me that file and keep sending until the entire thing is sent”. The use of windowing in the transfer layer (TCP for FTP and HTTP and within the SSH channels for SCP) allows flow control to work without having to ACK every single little packet. This is a great concept to keep the flow going at high speed and still allow the receiver to not get drowned. Even if there’s a high network latency involved.

The nature of SFTP and its ACK for every small data chunk it sends, makes an initial naive SFTP implementation suffer badly when sending data over high latency networks. If you have to wait a few hundred milliseconds for each 32KB of data then there will never be fast SFTP transfers. This sort of naive implementation is what libssh2 has offered up until and including libssh2 1.2.7.

To achieve speedy transfers with SFTP, we need to “pipeline” the packets. We need to send out several packets before we expect the answers to previous ones, to make the sending of an SFTP packet and the checking of the corresponding ACKs asynchronous. Like in the above example, we would send all WRITE commands before we wait for/expect the ACKs to come back from the server. Then the round-trip time essentially becomes a non-factor (or at least a very small one).

libssh2

We’ve worked on implementing this kind of pipelining for SFTP uploads in libssh2 and it seems to have paid off. In some measurements libssh2 is now one of the faster SFTP clients.

In tests I did over a high-latency connection, I could boost libssh2’s SFTP upload performance 8 (eight) times compared to the former behavior. In fact, that’s compared to earlier git behavior, comparing to the latest libssh2 release version (1.2.7) would most likely show an even greater difference.

My plan is now to implement this same concept for SFTP downloads in libssh2, and then look over if we shouldn’t offer a slightly modified API to allow applications to use pipelined transfers better and easier.