Category Archives: Web

web stuff

Testing 2-digit year numbers in cookies

In the current work of the IETF http-state working group, we’re documenting how cookies work. The question came up how browsers and clients treat years in ‘expires’ strings if the year is only specified with two digits. And more precisely, is 69 in the future or in the past?

I decided to figure that out. I setup a little CGI that can be used to check what your browser thinks:

http://daniel.haxx.se/cookie.cgi

It sends a single cookie header that looks like:

Set-Cookie: testme=yesyes; expires=Wed Sep  1 22:01:55 69;

The CGI script looks like this:

print "Content-Type: text/plain\n";
print "Set-Cookie: testme=yesyes; expires=Wed Sep  1 22:01:55 69;\n";
print "\nempty?\n";
print $ENV{'HTTP_COOKIE'};

You see that it prints the Cookie: header, so if you reload that URL you should see “testme=yesyes” being output if the cookie is still there. If the cookie is still there, your browser of choice treats the date above as a date in the future.

So, what browsers think 69 is in the future and what think 69 is in the past? Feel free to try out more browsers and tell me the results, this is the list we have so far:

Future:

Firefox v3 and v4 (year 2069)
curl (year 2038)
IE 7 (year 2069)
Opera (year 2036)
Konqueror 4.5
Android

Past:

Chrome (both v4 and v5)
Gnome Epiphany-Webkit

Thanks to my friends in #rockbox-community that helped me out!

(this info was originally posted to the httpstate mailing list)

Beyond just “69”

(this section was added after my first post)

After having done the above basic tests, I proceeded and wrote a slightly more involved test that sets 100 cookies in this format:

Set-Cookie: test$yy=set; expires=Wed Oct  1 22:01:55 $yy;

When the user reloads this page, the page prints all “test$yy” cookies that get sent to the server. The results with the various browsers is very interesting. These are the ranges different browsers think are future:

  • Firefox: 21 – 69 (Safari and Fennec and MicroB on n900) [*]
  • Chrome: 10 – 68
  • Konqueror: 00 – 99 (and IE3, Links, Netsurf, Voyager)
  • curl: 10 – 70
  • Opera: 41 – 69 (and Opera Mobile) [*]
  • IE8: 31 – 79 (and slimbrowser)
  • IE4: 61 – 79 (and IE5, IE6)
  • Midori: 10 – 69 (and IBrowse)
  • w3m: 10 – 37
  • AWeb: 10 – 77
  • Nokia 6300: [none]

[*] = Firefox has a default limit of 50 cookies per host which is the explanation to this funny range. When I changed the config ‘network.cookie.maxPerHost’ to 200 instead (thanks to Dan Witte), I got the more sensible and expected range 10 – 69. Opera has the similar thing, it has a limit of 30 cookies by default which explains the 41-69 limit in this case. It would otherwise get 10-69 as well. (thanks to Stanislaw Adrabinski). I guess that the IE8 range is similarly restricted due to it using a limit of 50 cookies per host and an epoch at 1980.

I couldn’t help myself from trying to parse what this means. The ranges can roughly be summarized like this:

0-9: mostly in the past
10-20: almost always in the future except Firefox
21-30: even more likely to be in the future except IE8
31-37: everyone but opera thinks this is the future
38-40: now w3m and opera think this is the past
41-68: everyone but w3m thinks this is the future
69: Chrome and w3m say past
70: curl, IE8, Konqueror say future
71-79: IE8 and Konqueror say future, everyone else say past
80-99: Konqueror say future, everyone else say past

How to test a browser near you:

  1. goto http://daniel.haxx.se/cookie2.cgi
  2. reload once
  3. the numbers shown on the screen is the year numbers the browsers consider
    to be the future as described above

Websockets right now

Lots of sites today have JavaScript running that connects to the site and keeps the connection open for a long time (or just doing very frequent checks if there is updated info to get). This to such an wide extent that people have started working on a better way. A way for scripts in browsers to connect to a site in a TCP-like manner to exchange messages back and forth, something that doesn’t suffer from the problems HTTP provides when (ab)used for this purpose.

Websockets is the name of the technology that has been lifted out from the HTML5 spec, taken to the IETF and is being worked on there to produce a network protocol that is basically a message-based low level protocol over TCP, designed to allow browsers to do long-living connections to servers instead of using “long-polling HTTP”, Ajax polling or other more or less ugly tricks.

Unfortunately, the name is used for both the on-the-wire protocol as well as the JavaScript API so there’s room for confusion when you read or hear this term, but I’m completely clueless about JavaScript so you can rest assured that I will only refer to the actual network protocol when I speak of Websockets!

I’m tracking the Hybi mailing list closely, and I’m summarizing this post on some of the issues that’s been debated lately. It is not an attempt to cover it all. There is a lot more to be said, both now and in the future.

The first versions of Websockets that were implemented by browsers (Chrome) was the -75 version, as written by Ian Hickson as editor and the main man behind it as it came directly from the WHATWG team (the group of browser-people that work on the HTML5 spec).

He also published a -76 version (that was implemented by Firefox) before it was taken over by the IETF’s Hybi working group, who published that same document as its -00 draft.

There are representatives for a lot of server software and from browsers like Opera, Firefox, Safari and Chrome on the list (no Internet Explorer people seem to be around).

Some Problems

The 76/00 version broke compatibility with the previous version and it also isn’t properly HTTP compatible that is wanted and needed by many.

The way Ian and WHATWG have previously worked on this (and the html5 spec?) is mostly by Ian leading and being the benevolent dictator of what’s good, what’s right and what goes into the spec. The collision with IETF’s way of working that include “everyone can have a say” and “rough consensus” has been harsh and a reason for a lot of arguments and long threads on the hybi mailing list, and I will throw out a guess that it will continue to be the source of “flames” even in the future.

Ian maintains – for example – that a hard requirement on the protocol is that it needs to be “trivial to implement for amateurs”, while basically everyone else have come to the conclusion that this requirement is mostly silly and should be reworded to “be simple” or similar, that avoids the word “amateur” completely or the implying that amateurs wouldn’t be able to follow specs etc.

Right now

(This is early August 2010, things and facts in this protocol are likely to change, potentially a lot, over time so if you read this later on you need to take the date of this writing into account.)

There was a Hybi meeting at the recent IETF 78 meeting in Maastricht where a range of issues got discussed and some of the controversial subjects did actually get quite convincing consensus – and some of those didn’t at all match the previous requirements or the existing -00 spec.

Almost in parallel, Ian suggested a schedule for work ahead that most people expressed a liking in: provide a version of the protocol that can be done in 4 weeks for browsers to adjust to right now, and then work on an updated version to be shipped in 6 months with more of the unknowns detailed.

There’s a very lively debate going on about the need for chunking, the need for multiplexing multiple streams over a single TCP connection and there was a very very long debate going on regarding if the protocol should send data length-prefixed or if data should use a sentinel that marks the end of it. The length-prefix approach (favored by yours truly) seem to have finally won. Exactly how the framing is to be done, what parts that are going to be core protocol and what to leave for extensions and how extensions should work are not yet settled. I think everyone wants the protocol to be “as simple as possible” but everyone has their own mind setup on what amount of features that “the simplest” possible form of the protocol has.

The subject on exactly how the handshake is going to be done isn’t quite agreed to either by my reading, even if there’s a way defined in the -00 spec. The primary connection method will be using a HTTP Upgrade: header and thus connect to the server’s port 80 and then upgrade the protocol from HTTP over to Websockets. A procedure that is documented and supported by HTTP, but there’s been a discussion on how exactly that should be done so that cross-protocol requests can be avoided to the furthest possible extent.

We’ve seen up and beyond 50 mails per day on the hybi mailing list during some of the busiest days the last couple of weeks. I don’t see any signs of it cooling down short-term, so I think the 4 weeks goal seems a bit too ambitious at this point but I’m not ruling it out. Especially since the working procedure with Ian at the wheel is not “IETF-ish” but it’s not controlled entirely by Ian either.

Future

We are likely to see a future world with a lot of client-side applications connecting back to the server. The amount of TCP connections for a single browser is likely to increase, a lot. In fact, even a single web application within a single tab is likely to do multiple websocket connections when they re-use components and widgets from several others.

It is also likely that libcurl will get a websocket implementation in the future to do raw websocket transfers with, as it most likely will be needed in a future to properly emulate a browser/client.

I hope to keep tracking the development, occasionally express my own opinion on the matter (on the list and here), but mostly stay on top of what’s going on so that I can feed that knowledge to my friends and make sure that the curl project keeps up with the time.

chinese-socket

(The pictured socket might not be a websocket, but it is a pretty remarkably designed power socket that is commonly seen in China, and as you can see it accepts and works with many of the world’s different power plug designs.)

Webscrapers unite!

The term web scraping has come to stay. It refers to getting stuff off web sites in an automated fashion. Other terms occasionally used for include spidering, web harvesting and more. I’ve used the term HTTP scripting for it.

The art is in many cases to act more or less like a browser with the single purpose of making the server not detect that it is a program in the other end.

This is something I’ve done a lot over the years while developing curl, and I’ve answered countless of questions on the matter. There are several good books (all of them having at least parts of them covering curl).

Today, we’ve setup a little web site over at webscrapers.haxx.se and an associated mailing list, in an attempt to gather people who are working in this field. Whether you do it for fun or profit doesn’t matter and while we can certainly discuss what tools you should or shouldn’t use for scraping, we certainly don’t limit which to discuss on the list as long as the subject is somewhat relevant.

So, if you’re into scraping or you want to learn more on how to automate your web bots, come on over and join the list and let’s see where this might take us!

Java apps froze my Iceweasel

I’ve noticed for quite some time that java apps haven’t seemed to work on my computer when I try to use them with my browser. Whenever I’ve started most java apps, my entire browser has just frozen and gone completely unresponsive and I’ve been forced to kill and restart it.

I run Debian Linux on just about all my machines so of course I do that on my primary desktop as well. And Iceweasel is my browser.

I’ve not really bothered much about the problem as java applets are a bit of yesterday’s technology and I rarely face anything in java that I need. Until today, when I had to login to a customer’s site and use an applet for some work related to my job. There was no decent way to avoid it (apart from perhaps logging in using another machine/browser or similar), so I decided to bite the bullet and finally fix my issue.

I searched around and I tried uninstalling all openjdk stuff and more. I restarted Iceweasel countless times to no avail.

Finally I stumbled over this post by user “almatic” and voila, it fixed my problem. As I just wasted like an hour on this, I’ll help out to make the world a little better by providing the answer here as well:

open file /etc/sysctl.d/bindv6only.conf and set net.ipv6.bindv6only=0, then restart the procfs with invoke-rc.d procps restart

here are the corresponding bugs

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=560238
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=560056

roffit lives!

Many moons ago I created a little tool I named roffit. It is just a tiny perl script that converts a man page written in the nroff format to good-looking HTML. I should perhaps also add that I didn’t find any decent alternatives then so I wrote up my own version. I’ve been using it since in projects such as curl, c-ares and libssh2 to produce web versions of the docs.

It has just done its job and I haven’t had any needs to fiddle with it. The project page lists it as last modified in 2004, even though I actually moved it to a sourceforge CVS repo back in 2007.

Just a few days ago, I got emailed and was notified that Debian has it included as a package in the distribution and someone was annoyed on some particular flaws.

This resulted in a bunch of bugs getting submitted to the Debian bug tracker, I started up the brand new roffit-devel mailing list to easier host roffit discussions and I switched over the CVS repo to a git one on github.

If you like seeing man pages turned into web pages, consider joining up and help us improve this thing!

My talk Optimera Sthlm

30 minutes is a tricky period to fill with contents when you do a talk, and yesterday I did my best at confusing/informing the audience at the OPTIMERA STHLM conference in transport layer performance. Where time is spent or lost today in TCP, what to think about to get things to behave faster, that RTT is not getting better even though brandwidth is growing really fast these days and a little about some future technologies like WebSockets, SPDY, SCTP and MPTCP.

Note: this talk is entirely in Swedish.

My slides for this is also viewable with slideshare.net like this:

OPTIMERA STHLM

Our friends at .SE are once again putting together an interesting conference-style day with talks, and this time the title of it is “OPTIMERA STHLM” (yes they use all caps) and it is all about optimizing web-related things.

I’ve been invited and I will do a 30 minute talk during that day about the transport layer and stuff on top of the transport layer. In other words it’ll include things to consider for TCP, DNS, HTTP, handling sockets, libcurl and a quick look at things such as Websockets, SPDY, MPTCP and SCTP.

The full day’s program is now available on the linked page. Enjoy!

cookie order

I’ve previously mentioned the IETF httpstate working group several times, and here’s some insights into one topic we’re currently discussing:

HTTP Cookie: sort order

The current httpstate draft for the updated cookie spec says that cookies that are sent to a server with the Cookie: header in a HTTP request need to be sorted. They must be sorted primarily on path length and secondary on creation-time.

According to others on the http-state mailinglist, that’s exactly what the major browsers do already and they thus think we should document that as that’s how cookies work.

During these discussions I’ve learned that cookies with the same name that have been specified with different paths, will all be sent in the request and thus they need to be sent in a particular order and in fact even the original Netscape “spec” said so. I agree with this and I’ve also modified libcurl to act accordingly.

I hold the position that only cookies that have identical names need this treatment and that’s what we must specify. Implementers however will most likely find that sorting all cookies at once will be easier.

The secondary sort key, the creation-time, is much more questionable. Why would the server care about that? How can they even rely on that?

Previously in this discussion (back in August 2009) I checked other open source cookie implementations to see how they deal with cookie sorting. I learned that perl’s LWP does sort them based on path length but on nothing else. The following tools did no sorting at all: wget, curl, libsoup, pavuk, lftp and aria2. I have no doubts that we can find more if we would search for more implementations in more languages and environments.

Specifying that sorting is a MUST on path length will still keep LWP and the next curl version compliant. Specifying that they MUST sort on creation-date as well will then make all of these projects non-conforming.

One problem I have in the cookie discussion in general is that the (5?) major browsers pretty much all behave the same and while the 5 major browsers have almost 100% of the web browser market for users, we cannot then automatically assume that they have 100% of the HTTP client or cookie-using client market. We just don’t know how many applications, tools and frameworks exist out there that aren’t actually browsers but still are using cookies.

Of course, I want to say that creation-time sorting is pointless but I have nothing to back that up with. No numbers. The other side of the discussion has a bunch of browsers that sort like this already, but no numbers or evidence if servers rely on this or how many that do.

Can any reader of this find a site out there that depends on the cookie order being sorted on creation-date?

(Reminder: the charter for the HTTPSTATE working group is to document existing widely used practices. It is not to solve problems or to fix problems in the existing cookie protocol. We all know and acknowledge that cookies as they currently work are quite flawed and painful.)

httpstate cookie domain pains

Back in 2008, I wrote about when Mozilla started to publish their effort the “public suffixes list” and I was a bit skeptic.

Well, the problem with domains in cookies of course didn’t suddenly go away and today when we’re working in the httpstate working group in the IETF with documenting how cookies work, we need to somehow describe how user agents tend to work with these and how they should work with them. It’s really painful.

The problem in short, is that a site that is called ‘www.example.com’ is allowed to set a cookie for ‘example.com’ but not for ‘com’ alone. That’s somewhat easy to understand. It gets more complicated at once when we consider the UK where ‘www.example.co.uk’ is fine but it cannot be allowed to set a cookie for ‘co.uk’.

So how does a user agent know that co.uk is magic?

Firefox (and Chrome I believe) uses the suffixes list mentioned above and IE is claimed to have its own (not published) version and Opera is using a trick where it tries to check if the domain name resolves to an IP address or not. (Although Yngve says Opera will soon do online lookups against the suffix list as well.) But if you want to avoid those tricks, Adam Barth’s description on the http-state mailing list really is creepy.

For me, being a libcurl hacker, I can’t of course but to think about Adam’s words in his last sentence: “If a user agent doesn’t care about security, then it can skip the public suffix check”.

Well. Ehm. We do care about security in the curl project but we still (currently) skip the public suffix check. I know, it is a bit of a contradiction but I guess I’m just too stuck in my opinion that the public suffixes list is a terrible solution but then I can’t figure anything that will work better and offer the same level of “safety”.

I’m thinking perhaps we should give in. Or what should we do? And how should we in the httpstate group document that existing and new user agents should behave to be optimally compliant and secure?

(in case you do take notice of details: yes the mailing list is named http-state while the working group is called httpstate without a dash!)