Tag Archives: HTTP

Byte ranges for FTP

In the IETF ftpext2 working group there have been some talks around clients’ and servers’ ability to do and support “ranged” file transfers, that is transferring only a piece of any given file. FTP supports the REST command and has done so since the dawn of man (RFC765 – June 1980), and using that command, a client can set the starting point for a transfer but there is no way to set the end point. HTTP has supported the Range: header since the first HTTP 1.1 spec back in January 1997, and that supports both a start and an end point. The HTTP header does in fact support multiple ranges within the same header, but let’s not overdo it here!

Currently, to avoid getting an entire file a client would simply close the data connection when it has got all the data it wants. The unfortunate reality is that some servers don’t notice clients doing this, so in order for this to work reliably a client also has to send ABOR, and after this command has been sent there is no way for the client to reliably figure out the state of the control connection so it has to get closed as well (which is crap in case more files are to be transferred to or from the same host). It primarily becomes unreliable because when ABOR is sent, the client gets one or two responses back due to a race condition between the closing and the actual end of transfer etc, and it isn’t possible to tell exactly how to continue.

A solution for the future is being worked on. I’ve joined up the effort to write a spec that will suggest a new FTP command that sets the end point for a transfer in the same vein REST sets the start point. For the moment, we’ve named our suggested command RANG (as short for range). “We” in this context means Tatsuhiro Tsujikawa, Anthony Bryan and myself but we of course hope to get further valuable feedback by the great ftpext2 people.

There already are use cases that want range request for FTP. The people behind metalinks for example want to download the same file from many servers, and then it makes sense to be able to download little pieces from different sources.

The people who found the libcurl bugs I linked to above use libcurl as part of the Fedora/Redhat installer Anaconda, and if I understand things right they use this feature to just get the beginning of some files to check them out and avoid having to download the full file before it knows it truly wants it. Thus it saves lots of bandwidth.

In short, the use-cases for ranged FTP retrievals are quite likely pretty much the same ones as they are for HTTP!

The first RANG draft is now available.

Websockets right now

Lots of sites today have JavaScript running that connects to the site and keeps the connection open for a long time (or just doing very frequent checks if there is updated info to get). This to such an wide extent that people have started working on a better way. A way for scripts in browsers to connect to a site in a TCP-like manner to exchange messages back and forth, something that doesn’t suffer from the problems HTTP provides when (ab)used for this purpose.

Websockets is the name of the technology that has been lifted out from the HTML5 spec, taken to the IETF and is being worked on there to produce a network protocol that is basically a message-based low level protocol over TCP, designed to allow browsers to do long-living connections to servers instead of using “long-polling HTTP”, Ajax polling or other more or less ugly tricks.

Unfortunately, the name is used for both the on-the-wire protocol as well as the JavaScript API so there’s room for confusion when you read or hear this term, but I’m completely clueless about JavaScript so you can rest assured that I will only refer to the actual network protocol when I speak of Websockets!

I’m tracking the Hybi mailing list closely, and I’m summarizing this post on some of the issues that’s been debated lately. It is not an attempt to cover it all. There is a lot more to be said, both now and in the future.

The first versions of Websockets that were implemented by browsers (Chrome) was the -75 version, as written by Ian Hickson as editor and the main man behind it as it came directly from the WHATWG team (the group of browser-people that work on the HTML5 spec).

He also published a -76 version (that was implemented by Firefox) before it was taken over by the IETF’s Hybi working group, who published that same document as its -00 draft.

There are representatives for a lot of server software and from browsers like Opera, Firefox, Safari and Chrome on the list (no Internet Explorer people seem to be around).

Some Problems

The 76/00 version broke compatibility with the previous version and it also isn’t properly HTTP compatible that is wanted and needed by many.

The way Ian and WHATWG have previously worked on this (and the html5 spec?) is mostly by Ian leading and being the benevolent dictator of what’s good, what’s right and what goes into the spec. The collision with IETF’s way of working that include “everyone can have a say” and “rough consensus” has been harsh and a reason for a lot of arguments and long threads on the hybi mailing list, and I will throw out a guess that it will continue to be the source of “flames” even in the future.

Ian maintains – for example – that a hard requirement on the protocol is that it needs to be “trivial to implement for amateurs”, while basically everyone else have come to the conclusion that this requirement is mostly silly and should be reworded to “be simple” or similar, that avoids the word “amateur” completely or the implying that amateurs wouldn’t be able to follow specs etc.

Right now

(This is early August 2010, things and facts in this protocol are likely to change, potentially a lot, over time so if you read this later on you need to take the date of this writing into account.)

There was a Hybi meeting at the recent IETF 78 meeting in Maastricht where a range of issues got discussed and some of the controversial subjects did actually get quite convincing consensus – and some of those didn’t at all match the previous requirements or the existing -00 spec.

Almost in parallel, Ian suggested a schedule for work ahead that most people expressed a liking in: provide a version of the protocol that can be done in 4 weeks for browsers to adjust to right now, and then work on an updated version to be shipped in 6 months with more of the unknowns detailed.

There’s a very lively debate going on about the need for chunking, the need for multiplexing multiple streams over a single TCP connection and there was a very very long debate going on regarding if the protocol should send data length-prefixed or if data should use a sentinel that marks the end of it. The length-prefix approach (favored by yours truly) seem to have finally won. Exactly how the framing is to be done, what parts that are going to be core protocol and what to leave for extensions and how extensions should work are not yet settled. I think everyone wants the protocol to be “as simple as possible” but everyone has their own mind setup on what amount of features that “the simplest” possible form of the protocol has.

The subject on exactly how the handshake is going to be done isn’t quite agreed to either by my reading, even if there’s a way defined in the -00 spec. The primary connection method will be using a HTTP Upgrade: header and thus connect to the server’s port 80 and then upgrade the protocol from HTTP over to Websockets. A procedure that is documented and supported by HTTP, but there’s been a discussion on how exactly that should be done so that cross-protocol requests can be avoided to the furthest possible extent.

We’ve seen up and beyond 50 mails per day on the hybi mailing list during some of the busiest days the last couple of weeks. I don’t see any signs of it cooling down short-term, so I think the 4 weeks goal seems a bit too ambitious at this point but I’m not ruling it out. Especially since the working procedure with Ian at the wheel is not “IETF-ish” but it’s not controlled entirely by Ian either.

Future

We are likely to see a future world with a lot of client-side applications connecting back to the server. The amount of TCP connections for a single browser is likely to increase, a lot. In fact, even a single web application within a single tab is likely to do multiple websocket connections when they re-use components and widgets from several others.

It is also likely that libcurl will get a websocket implementation in the future to do raw websocket transfers with, as it most likely will be needed in a future to properly emulate a browser/client.

I hope to keep tracking the development, occasionally express my own opinion on the matter (on the list and here), but mostly stay on top of what’s going on so that I can feed that knowledge to my friends and make sure that the curl project keeps up with the time.

chinese-socket

(The pictured socket might not be a websocket, but it is a pretty remarkably designed power socket that is commonly seen in China, and as you can see it accepts and works with many of the world’s different power plug designs.)

curl performance

Benchmarks, speed, comparisons, performance. I get a lot of questions on how curl and libcurl compare against other tools or libraries, and I rarely have any specific answers as I personally basically never use or test any other tools or libraries!

This text will instead be more elaborate on how we work on libcurl, why I believe libcurl will remain the fastest alternative.

cURL

libcurl is low-level

libcurl is written in C and uses the native function calls of the operating system to perform its network operations. It features a lot of features, but when it comes to plain sending and receiving data the code paths are very short and the loops can’t be shortened or sped up by any significant amount. Based on these facts, I am confident that for simple single stream transfers you really cannot write a file transfer library that runs faster. (But yes, I believe other similarly low-level style libraries can reach the same speeds.)

When adding more complicated test cases, like doing SSL or perhaps many connections that need to be kept persistent between transfers, then of course libraries can start to differ. libcurl uses SSL libraries natively so if they are fast, so will libcurl’s SSL handling be and vice versa. Of course we also strive to provide features such as connection pooling, SSL session id reuse, DNS caching and more to make the normal and frequent use cases as fast as we possibly can. What takes time when using libcurl should be the underlying network operations, not the tricks libcurl adds to them.

event-based is the way to grow

If you plan on making an app that uses more than just a few connections, libcurl can of course still do the heavy lifting for you. You should consider taking precautions already when you do your design and make sure that you can use an event-based concept and avoid relying on select or poll for the socket handling. Using libcurl’s multi_socket API, you can go up and beyond tens of thousands of connections and still reach maximum performance. And this using basically all of the protocols libcurl supports, this is not limited to a small subset. (There are unfortunate exceptions, like for example “file://” URLs but there are completely technical reasons for this.)

Very few file transfer libraries have this direct support for event-based operations. I’ve read reports of apps that have gone up to and beyond 70,000 connections on the same host using libcurl like this. The fact that TCP only has a 16 bit field in the protocol header for “source port” of course forces users who want to try this stunt to use more than one interface as source address.

And before you ask: you cannot grow a client to that amount using any other technique than event-based using many connections in the same thread, as basically no other approach scales as well.

When handling very many connections, the mere “juggling” of the connections take time and can be done in good or bad ways. It would be interesting to one day measure exactly how good libcurl is at this.

Binding Benchmark and Comparisons

We are aware of something like 40 different bindings for libcurl to make it possible to use from just about any language you like. Lots, if not just every, language also tend to have its own native version of a transfer or at least HTTP library. For many languages, the native version is the one that is most preferred and most used in books, articles and promoted on the internet. Rarely can the native versions compete with the libcurl based ones in actual transfer performance. Because of what I mentioned above: libcurl does little extra when transferring stuff but the actual and raw transfer. Of course there still needs to be additional glue and logic to make libcurl work fine with the different languages’ own unique environments, but it is still often proven that it doesn’t make the speed gain get lost or become invisible. I’ll illustrate below with some sample environments.

Ruby comparisons

Paul Dix is a Ruby guy and he’s done a lot of work with HTTP libraries and Ruby, and he’s also done some benchmarks on libcurl-based Ruby libraries. They show that the tools built on top of libcurl run significantly faster than the native versions.

Perl comparisons

“Ivan” wrote up a benchmarking script that performs a number of transfers using three different mechanisms available to perl hackers. One of them being the official libcurl perl binding (WWW::Curl), one of them being the perl standard one called LWP. The results leaves no room for doubts: the libcurl-based version is significantly faster than the “native” alternatives.

PHP comparisons

The PHP binding for libcurl, PHP/CURL, is a popular one. In PHP the situation is possibly a bit different as they don’t have a native library that is nearly as feature complete as the libcurl binding, but they do have a native version for doing things like getting HTTP data etc. This function has been compared against PHP/CURL many times, for example Ricky’s comparisons, and Alix Axel’s comparisons. They all show that the libcurl-based alternative is faster. Exactly how much faster is of course depending on a lot of factors but I’m not going into such specific details here and now.

We miss more benchmarks!

I wish I knew about more benchmarks and comparisons of speed. If you know of others, or if you get inspired enough to write up and publish any after reading my rant here, please let me know! Not only is it fun and ego-boosting to see our project win, but I also want to learn from them and see where we’re lacking and if anyone beats us in a test, it’d be great to see what we could do to improve.

I’ll talk at FSCONS 2010

Recently I was informed that I got two talks accepted to the FSCONS 2010 conference, to be held in the beginning of November 2010.

My talks will be about the Future and current state of internet transport protocols (TCP, HTTP, SPDY, WebSockets, SCTP and more) and on High performance multi-protocol applications with libcurl, which will educate the audience on how to use libcurl when doing high performance clients with potentially a very large number of simultaneous transfers. A somewhat clueful reader will of course spot that these two talks have a lot in common, and yeah they do reveal a lot of what I do and what I like and what I poke on these days. I hope I’ll be able to put the light on some things not everyone is already perfectly aware of.

The talks will be held in English, and if the past FSCONS conferences tell anything, my talks will be video filmed and become available online afterward for the world to see if you have a funeral or something to attend to that prevents you from actually attending in person.

If you have thoughts, questions or anything on these topics that you would like to get answered in my talk, feel free to bring them up and I’ll see what I can do.

(If those fine guys and gals at FSCONS ever settled for a logo, or had one I could link to, I would’ve shown one of them right here.)

Webscrapers unite!

The term web scraping has come to stay. It refers to getting stuff off web sites in an automated fashion. Other terms occasionally used for include spidering, web harvesting and more. I’ve used the term HTTP scripting for it.

The art is in many cases to act more or less like a browser with the single purpose of making the server not detect that it is a program in the other end.

This is something I’ve done a lot over the years while developing curl, and I’ve answered countless of questions on the matter. There are several good books (all of them having at least parts of them covering curl).

Today, we’ve setup a little web site over at webscrapers.haxx.se and an associated mailing list, in an attempt to gather people who are working in this field. Whether you do it for fun or profit doesn’t matter and while we can certainly discuss what tools you should or shouldn’t use for scraping, we certainly don’t limit which to discuss on the list as long as the subject is somewhat relevant.

So, if you’re into scraping or you want to learn more on how to automate your web bots, come on over and join the list and let’s see where this might take us!

My talk Optimera Sthlm

30 minutes is a tricky period to fill with contents when you do a talk, and yesterday I did my best at confusing/informing the audience at the OPTIMERA STHLM conference in transport layer performance. Where time is spent or lost today in TCP, what to think about to get things to behave faster, that RTT is not getting better even though brandwidth is growing really fast these days and a little about some future technologies like WebSockets, SPDY, SCTP and MPTCP.

Note: this talk is entirely in Swedish.

My slides for this is also viewable with slideshare.net like this:

OPTIMERA STHLM

Our friends at .SE are once again putting together an interesting conference-style day with talks, and this time the title of it is “OPTIMERA STHLM” (yes they use all caps) and it is all about optimizing web-related things.

I’ve been invited and I will do a 30 minute talk during that day about the transport layer and stuff on top of the transport layer. In other words it’ll include things to consider for TCP, DNS, HTTP, handling sockets, libcurl and a quick look at things such as Websockets, SPDY, MPTCP and SCTP.

The full day’s program is now available on the linked page. Enjoy!

curl and speced cookie order

I just posted this on the curl-library list, but I feel it suits to be mentioned here separately.

As I’ve mentioned before, I’m involved in the IETF http-state working group which is working to document how cookies are used in the wild today. The idea is to create a spec that new implementations can follow and that existing implementations can use to become more interoperable.

(If you’re interested in these matters, I can only urge you to join the http-state mailing list and participate in the discussions.)

The subject of how to order cookies in outgoing HTTP Cookie: headers have been much debated over the recent months and I’ve also blogged about it. Now, the issue has been closed and the decision went quite opposite to my standpoint and now the spec will say that while the servers SHOULD not rely on the order (yeah right, some obviously already do and with this specified like this even more will soon do the same) it will recommend clients to sort the cookies in a given way that is close to the way current Firefox does it[*].

This has the unfortunate side-effect that to become fully compatible with how the browsers do cookies, we will need to sort our cookies a bit more than what we just recently introduced. That in itself really isn’t very hard since once we introduced qsort() it is easy to sort on more/other keys.

The biggest problem we get with this, is that the sorting uses creation time of the cookies. libcurl and curl and others mostly use the Netscape cookie files to store cookies and keep state between invokes, and that file format doesn’t include creation time info! It is a simple text-based file format with TAB-separated columns and the last (7th) column is the cookie’s content.

In order to support the correct sorting between sessions, we need to invent a way to pass on the creation time. My thinking is that we do this in a way that allows older libcurls still understand the file but just not see/understand the creation time, while newer versions will be able to get it. This would be possible by extending the expires field (the 6th) as it is a numerical value that the existing code will parse as a number and it will stop at the first non-digit character. We could easily add a character separation and store the
creation time after. Like:

Old expire time:

2345678

New expire+creation time:

2345678/1234567

This format might even work with other readers of this file format if they do similar assumptions on the data, but the truth is that while we picked the format in the first place to be able to exchange cookies with a well known and well used browser, no current browser uses that format anymore. I assume there are still a bunch of other tools that do, like wget and friends.

Update: as my friend Micah Cowan explains: we can in fact use the order of the cookie file as “creation time” hint and use that as basis for sorting. Then we don’t need to modify the file format. We just need to make sure to store them in time-order… Internally we will need to keep a line number or something per cookie so that we can use that for sorting.

[*] – I believe it sorts on path length, domain length and time of creation, but as soon as the -06 draft goes online it will be easy to read the exact phrasing. The existing -05 draft exists at: http://tools.ietf.org/html/draft-ietf-httpstate-cookie-05

httpstate cookie domain pains

Back in 2008, I wrote about when Mozilla started to publish their effort the “public suffixes list” and I was a bit skeptic.

Well, the problem with domains in cookies of course didn’t suddenly go away and today when we’re working in the httpstate working group in the IETF with documenting how cookies work, we need to somehow describe how user agents tend to work with these and how they should work with them. It’s really painful.

The problem in short, is that a site that is called ‘www.example.com’ is allowed to set a cookie for ‘example.com’ but not for ‘com’ alone. That’s somewhat easy to understand. It gets more complicated at once when we consider the UK where ‘www.example.co.uk’ is fine but it cannot be allowed to set a cookie for ‘co.uk’.

So how does a user agent know that co.uk is magic?

Firefox (and Chrome I believe) uses the suffixes list mentioned above and IE is claimed to have its own (not published) version and Opera is using a trick where it tries to check if the domain name resolves to an IP address or not. (Although Yngve says Opera will soon do online lookups against the suffix list as well.) But if you want to avoid those tricks, Adam Barth’s description on the http-state mailing list really is creepy.

For me, being a libcurl hacker, I can’t of course but to think about Adam’s words in his last sentence: “If a user agent doesn’t care about security, then it can skip the public suffix check”.

Well. Ehm. We do care about security in the curl project but we still (currently) skip the public suffix check. I know, it is a bit of a contradiction but I guess I’m just too stuck in my opinion that the public suffixes list is a terrible solution but then I can’t figure anything that will work better and offer the same level of “safety”.

I’m thinking perhaps we should give in. Or what should we do? And how should we in the httpstate group document that existing and new user agents should behave to be optimally compliant and secure?

(in case you do take notice of details: yes the mailing list is named http-state while the working group is called httpstate without a dash!)

My Nordic Free Software Awards 2009 nominees

Hey, it’s really about time to nominate your favourite Free Software persons and projects from the nordic region for the 2009 awards before the time runs out.

This year, I decided to nominate the following “nordic” heroes:

Simon Josefsson

For his excellent work in GnuTLS, libssh2 and a bunch of other projects.

Henrik Nordström

For his work in the Squid project, and his efforts within IETF and its HTTP related struggles and more.

Björn Stenberg

As the primary founder of the Rockbox project. He started somehting special back in 2001 that now is a huge, thriving and succesful Free Software project.

As you might spot, I favor “doers”. I don’t believe in the concept of “nordic projects” when it comes to free or open software – the entire concept of open and free should mean that projects cross borders and regions.

In fact, it feels so out of the ordinary to think about open source people in a geographical context I find it hard to come up with a lot of names. It would be cool if ohloh had some ways to list people and projects based on where people live.

Then again, if a person from a nordic country moves somewhere else, is he or she still a nordic person? Does it depend on where the person lived during the actual act? Is Linus Torvalds a nordic person since he was born, lived many years and started his big project in Finland?

(yeah I already blogged about this subject but hey, it can’t hurt can it?)