Archive for the ‘Open Source’ Category

Why curl defaults to stdout

Monday, November 17th, 2014

(Recap: I founded the curl project, I am still the lead developer and maintainer)

When asking curl to get a URL it’ll send the output to stdout by default. You can of course easily change this behavior with options or just using your shell’s redirect feature, but without any option it’ll spew it out to stdout. If you’re invoking the command line on a shell prompt you’ll immediately get to see the response as soon as it arrives.

I decided curl should work like this, and it was a natural decision I made already when I worked on the predecessors during 1997 or so that later would turn into curl.

On Unix systems there’s a common mantra that “everything is a file” but also in fact that “everything is a pipe”. You accomplish things on Unix by piping the output of one program into the input of another program. Of course I wanted curl to work as good as the other components and I wanted it to blend in with the rest. I wanted curl to feel like cat but for a network resource. And cat is certainly not the only pre-curl command that writes to stdout by default; they are plentiful.

And then: once I had made that decision and I released curl for the first time on March 20, 1998: the call was made. The default was set. I will not change a default and hurt millions of users. I rather continue to be questioned by newcomers, but now at least I can point to this blog post! :-)

About the wget rivalry

cURLAs I mention in my curl vs wget document, a very common comment to me about curl as compared to wget is that wget is “easier to use” because it needs no extra argument in order to download a single URL to a file on disk. I get that, if you type the full commands by hand you’ll use about three keys less to write “wget” instead of “curl -O”, but on the other hand if this is an operation you do often and you care so much about saving key presses I would suggest you make an alias anyway that is even shorter and then the amount of options for the command really doesn’t matter at all anymore.

I put that argument in the same category as the people who argue that wget is easier to use because you can type it with your left hand only on a qwerty keyboard. Sure, that is indeed true but I read it more like someone trying to come up with a reason when in reality there’s actually another one underneath. Sometimes that other reason is a philosophical one about preferring GNU software (which curl isn’t) or one that is licensed under the GPL (which wget is) or simply that wget is what they’re used to and they know its options and recognize or like its progress meter better.

I enjoy our friendly competition with wget and I seriously and honestly think it has made both our projects better and I like that users can throw arguments in our face like “but X can do Y”and X can alter between curl and wget depending on which camp you talk to. I also really like wget as a tool and I am the occasional user of it, just like most Unix users. I contribute to the wget project well, both with code and with general feedback. I consider myself a friend of the current wget maintainer as well as former ones.

Keyboard key frequency

Wednesday, November 12th, 2014

A while ago I wrote about my hunt for a new keyboard, and in my follow-up conversations with friends around that subject I quickly came to the conclusion I should get myself better analysis and data on how I actually use a keyboard and the individual keys on it. And if you know me, you know I like (useless) statistics.

Func KB-460 keyboardSo, I tried out the popular and widely used Linux key-logger software ‘logkeys‘ and immediately figured out that it doesn’t really support the precision and detail level I wanted so I forked the project and modified the code to work the way I want it: keyfreq was born. Code on github. (I forked it because I couldn’t find any way to send back my modifications to the upstream project, I don’t really feel a need for another project.)

Then I fired up the logging process and it has been running in the background for a while now, logging every key stroke with a time stamp.

Counting key frequency and how it gets distributed very quickly turns into basically seeing when I’m active in front of the computer and it also gave me thoughts around what a high key frequency actually means in terms of activity and productivity. Does a really high key frequency really mean that I was working intensely or isn’t that purpose more a sign of mail sending time? When I debug problems or research details, won’t those periods result in slower key activity?

In the end I guess that over time, the key frequency chart basically says that if I have pressed a lot of keys during a period, I was working on something then. Hours or days with a very low average key frequency are probably times when I don’t work as much.

The weekend key frequency is bound to be slightly wrong due to me sometimes doing weekend hacking on other computers where I don’t log the keys since my results are recorded from a single specific keyboard only.

Conclusions

So what did I learn? Here are some conclusions and results from 1276614 keystrokes done over a period of the most recent 52 calendar days.

I have a 105-key keyboard, but during this period I only pressed 90 unique keys. Out of the 90 keys I pressed, 3 were pressed more than 5% of the time – each. In fact, those 3 keys are more than 20% of all keystrokes. Those keys are: <Space>, <Backspace> and the letter ‘e’.

<Space> stands out from all the rest as it has been used more than 10%.

Only 29 keys were used more than 1% of the presses, giving this a really long tail with lots of keys hardly ever used.

Over this logged time, I have registered key strokes during 46% of all hours. Counting only the hours in which I actually used the keyboard, the average number of key strokes were 2185/hour, 36 keys/minute.

The average week day (excluding weekend days), I registered 32486 key presses. The most active sinngle minute during this logging period, I hit 405 keys. The most active single hour I managed to do 7937 key presses. During weekends my activity is much lower, and then I average at 5778 keys/day (7.2% of all activity were weekends).

When counting most active hours over the day, there are 14 hours that have more than 1% activity and there are 5 with less than 1%, leaving 5 hours with no keyboard activity at all (02:00- 06:59). Interestingly, the hour between 23-24 at night is the single most busy hour for me, with 12.5% of all keypresses during the period.

Random “anecdotes”

Longest contiguous time without keys: 26.4 hours

Longest key sequence without backspace: 946

There are 7 keys I only pressed once during this period; 4 of them are on the numerical keypad and the other three are F10, F3 and <Pause>.

More

I’ll try to keep the logging going and see if things change over time or if there later might end up things that can be seen in the data when looked over a longer period.

Changing networks on Mac with Firefox

Thursday, October 30th, 2014

Not too long ago I blogged about my work to better deal with changing networks while Firefox is running. That job was basically two parts.

A) generic code to handle receiving such a network-changed event and then

B) a platform specific part that was for Windows that detected such a network change and sent the event

Today I’ve landed yet another fix for part B called bug 1079385, which detects network changes for Firefox on Mac OS X.

mac miniI’ve never programmed anything before on the Mac so this was sort of my christening in this environment. I mean, I’ve written countless of POSIX compliant programs including curl and friends that certainly builds and runs on Mac OS just fine, but I never before used the Mac-specific APIs to do things.

I got a mac mini just two weeks ago to work on this. Getting it up, prepared and my first Firefox built from source took all-in-all less than three hours. Learning the details of the mac API world was much more trouble and can’t say that I’m mastering it now either but I did find myself at least figuring out how to detect when IP addresses on the interfaces change and a changed address is a pretty good signal that the network changed somehow.

daniel.haxx.se episode 8

Monday, October 27th, 2014

Today I hesitated to make my new weekly video episode. I looked at the viewers number and how they basically have dwindled the last few weeks. I’m not making this video series interesting enough for a very large crowd of people. I’m re-evaluating if I should do them at all, or if I can do something to spice them up…

… or perhaps just not look at the viewers numbers at all and just do what think is fun?

I decided I’ll go with the latter for now. After all, I enjoy making these and they usually give me some interesting feedback and discussions even if the numbers are really low. What good is a number anyway?

This week’s episode:

Personal

Firefox

Fun

HTTP/2

TALKS

  • I’m offering two talks for FOSDEM

curl

  • release next Wednesday
  • bug fixing period
  • security advisory is pending

wget

Stricter HTTP 1.1 framing good bye

Sunday, October 26th, 2014

I worked on a patch for Firefox bug 237623 to make sure Firefox would use a stricter check for “HTTP 1.1 framing”, checking that Content-Length is correct and that there’s no broken chunked encoding pieces. I was happy to close an over 10 years old bug when the fix landed in June 2014.

The fix landed and has not caused any grief all the way since June through to the actual live release (Nightlies, Aurora, Beta etc). This change finally shipped in Firefox 33 and I had more or less already started to forget about it, and now things went south really fast.

The amount of broken servers ended up too massive for us and we had to backpedal. The largest amount of problems can be split up in these two categories:

  1. Servers that deliver gzipped content and sends a Content-Length: for the uncompressed data. This seems to be commonly done with old mod_deflate and mod_fastcgi versions on Apache, but we also saw people using IIS reporting this symptom.
  2. Servers that deliver chunked-encoding but who skip the final zero-size chunk so that the stream actually never really ends.

We recognize that not everyone can have the servers fixed – even if all these servers should still be fixed! We now make these HTTP 1.1 framing problems get detected but only cause a problem if a certain pref variable is set (network.http.enforce-framing.http1), and since that is disabled by default they will be silently ignored much like before. The Internet is a more broken and more sad place than I want to accept at times.

We haven’t fully worked out how to also make the download manager (ie the thing that downloads things directly to disk, without showing it in the browser) happy, which was the original reason for bug 237623…

Although the code may now no longer alert anything about HTTP 1.1 framing problems, it will now at least mark the connection not due for re-use which will be a big boost compared to before since these broken framing cases really hurt persistent connections use. The partial transfer return codes for broken SPDY and HTTP/2 transfers remain though and I hope to be able to remain stricter with these newer protocols.

This partial reversion will land ASAP and get merged into patch releases of Firefox 33 and later.

Finally, to top this off. Here’s a picture of an old HTTP 1.1 frame so that you know what we’re talking about.

An old http1.1 frame

Pretending port zero is a normal one

Saturday, October 25th, 2014

Speaking the TCP protocol, we communicate between “ports” in the local and remote ends. Each of these port fields are 16 bits in the protocol header so they can hold values between 0 – 65535. (IPv4 or IPv6 are the same here.) We usually do HTTP on port 80 and we do HTTPS on port 443 and so on. We can even play around and use them on various other custom ports when we feel like it.

But what about port 0 (zero) ? Sure, IANA lists the port as “reserved” for TCP and UDP but that’s just a rule in a list of ports, not actually a filter implemented by anyone.

In the actual TCP protocol port 0 is nothing special but just another number. Several people have told me “it is not supposed to be used” or that it is otherwise somehow considered bad to use this port over the internet. I don’t really know where this notion comes from more than that IANA listing.

Frank Gevaerts helped me perform some experiments with TCP port zero on Linux.

In the Berkeley sockets API widely used for doing TCP communications, port zero has a bit of a harder situation. Most of the functions and structs treat zero as just another number so there’s virtually no problem as a client to connect to this port using for example curl. See below for a printout from a test shot.

Running a TCP server on port 0 however, is tricky since the bind() function uses a zero in the port number to mean “pick a random one” (I can only assume this was a mistake done eons ago that can’t be changed). For this test, a little iptables trickery was run so that incoming traffic on TCP port 0 would be redirected to port 80 on the server machine, so that we didn’t have to patch any server code.

Entering a URL with port number zero to Firefox gets this message displayed:

This address uses a network port which is normally used for purposes other than Web browsing. Firefox has canceled the request for your protection.

… but Chrome accepts it and tries to use it as given.

The only little nit that remains when using curl against port 0 is that it seems glibc’s getpeername() assumes this is an illegal port number and refuses to work. I marked that line in curl’s output in red below just to highlight it for you. The actual source code with this check is here. This failure is not lethal for libcurl, it will just have slightly less info but will still continue to work. I claim this is a glibc bug.

$ curl -v http://10.0.0.1:0 -H "Host: 10.0.0.1"
* Rebuilt URL to: http://10.0.0.1:0/
* Hostname was NOT found in DNS cache
* Trying 10.0.0.1...
* getpeername() failed with errno 107: Transport endpoint is not connected
* Connected to 10.0.0.1 () port 0 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.38.1-DEV
> Accept: */*
> Host: 10.0.0.1
>
< HTTP/1.1 200 OK
< Date: Fri, 24 Oct 2014 09:08:02 GMT
< Server: Apache/2.4.10 (Debian)
< Last-Modified: Fri, 24 Oct 2014 08:48:34 GMT
< Content-Length: 22
< Content-Type: text/html

<html>testpage</html>

Why doing this experiment? Just for fun to to see if it worked.

(Discussion and comments on this post is also found at Reddit.)

curl is no POODLE

Friday, October 17th, 2014

Once again the internet flooded over with reports and alerts about a vulnerability using a funny name: POODLE. If you have even the slightest interest in this sort of stuff you’ve already grown tired and bored about everything that’s been written about this so why on earth do I have to pile on and add to the pain?

This is my way of explaining how POODLE affects or doesn’t affect curl, libcurl and the huge amount of existing applications using libcurl.

Is my application using HTTPS with libcurl or curl vulnerable to POODLE?

No. POODLE really is a browser-attack.

Motivation

The POODLE attack is a combination of several separate pieces that when combined allow attackers to exploit it. The individual pieces are not enough stand-alone.

SSLv3 is getting a lot of heat now since POODLE must be able to downgrade a connection to SSLv3 from TLS to work. Downgrade in a fairly crude way – in libcurl, only libcurl built to use NSS as its TLS backend supports this way of downgrading the protocol level.

Then, if an attacker manages to downgrade to SSLv3 (both the client and server must thus allow this) and get to use the sensitive block cipher of that protocol, it must maintain a connection to the server and then retry many similar requests to the server in order to try to work out details of the request – to figure out secrets it shouldn’t be able to. This would typically be made using javascript in a browser and really only HTTPS allows this so no other SSL-using protocol can be exploited like this.

For the typical curl user or a libcurl user, there’s A) no javascript and B) the application already knows the request it is doing and normally doesn’t inject random stuff from 3rd party sources that could be allowed to steal secrets. There’s really no room for any outsider here to steal secrets or cookies or whatever.

How will curl change

There’s no immediate need to do anything as curl and libcurl are not vulnerable to POODLE.

Still, SSLv3 is long overdue and is not really a modern protocol (TLS 1.0, the successor, had its RFC published 1999) so in order to really avoid the risk that it will be possible exploit this protocol one way or another now or later using curl/libcurl, we will disable SSLv3 by default in the next curl release. For all TLS backends.

Why? Just to be extra super cautious and because this attack helped us remember that SSLv3 is old and should be let down to die.

If possible, explicitly requesting SSLv3 should still be possible so that users can still work with their legacy systems in dire need of upgrade but placed in corners of the world that every sensible human has since long forgotten or just ignored.

In-depth explanations of POODLE

I especially like the ones provided by PolarSSL and GnuTLS, possibly due to their clear “distance” from browsers.

FOSS them students

Thursday, October 16th, 2014

On October 16th, I visited DSV at Stockholm University where I had the pleasure of holding a talk and discussion with students (and a few teachers) under the topic Contribute to Open Source. Around 30 persons attended.

Here are the slides I use, as usual possibly not perfectly telling stand-alone without the talk but there was no recording made and I talked in Swedish anyway…

Contribute to Open Source from Daniel Stenberg

What a removed search from Google looks like

Sunday, October 12th, 2014

Back in the days when I participated in the starting of the Subversion project, I found the mailing list archive we had really dysfunctional and hard to use, so I set up a separate archive for the benefit of everyone who wanted an alternative way to find Subversion related posts.

This archive is still alive and it recently surpassed 370,000 archived emails, all related to Subversion, for seven different mailing lists.

Today I received a notice from Google (shown in its entirety below) that one of the mails received in 2009 is now apparently removed from a search using a name – if done within the European Union at least. It is hard to take this seriously when you look at the page in question, and as there aren’t that very many names involved in that page the possibilities of which name it is aren’t that many. As there are several different mail archives for Subversion mails I can only assume that the alternative search results also have been removed.

This is the first removal I’ve got for any of the sites and contents I host.


Notice of removal from Google Search

Hello,

Due to a request under data protection law in Europe, we are no longer able to show one or more pages from your site in our search results in response to some search queries for names or other personal identifiers. Only results on European versions of Google are affected. No action is required from you.

These pages have not been blocked entirely from our search results, and will continue to appear for queries other than those specified by individuals in the European data protection law requests we have honored. Unfortunately, due to individual privacy concerns, we are not able to disclose which queries have been affected.

Please note that in many cases, the affected queries do not relate to the name of any person mentioned prominently on the page. For example, in some cases, the name may appear only in a comment section.

If you believe Google should be aware of additional information regarding this content that might result in a reversal or other change to this removal action, you can use our form at https://www.google.com/webmasters/tools/eu-privacy-webmaster. Please note that we can’t guarantee responses to submissions to that form.

The following URLs have been affected by this action:

http://svn.haxx.se/users/archive-2009-08/0808.shtml

Regards,

The Google Team

internal timers and timeouts of libcurl

Friday, October 10th, 2014

wall clockBear with me. It is time to take a deep dive into the libcurl internals and see how it handles timeouts and timers. This is meant as useful information to libcurl users but even more as insights for people who’d like to fiddle with libcurl internals and work on its source code and architecture.

socket activity or timeout

Everything internally in libcurl is using the multi, asynchronous, interface. We avoid blocking calls as far as we can. This means that libcurl always either waits for activity on a socket/file descriptor or for the time to come to do something. If there’s no socket activity and no timeout, there’s nothing to do and it just returns back out.

It is important to remember here that the API for libcurl doesn’t force the user to call it again within or at the specific time and it also allows users to call it again “too soon” if they like. Some users will even busy-loop like crazy and keep hammering the API like a machine-gun and we must deal with that. So, the timeouts are mostly to be considered advisory.

many timeouts

A single transfer can have multiple timeouts. For example one maximum time for the entire transfer, one for the connection phase and perhaps even more timers that handle for example speed caps (that makes libcurl not transfer data faster than a set limit) or detecting transfers speeds below a certain threshold within a given time period.

A single transfer is done with a single easy handle, which holds a list of all its timeouts in a sorted list. It allows libcurl to return a single time left until the nearest timeout expires without having to bother with the remainder of the timeouts (yet).

Curl_expire()

… is the internal function to set a timeout to expire a certain number of milliseconds into the future. It adds a timeout entry to the list of timeouts. Expiring a timeout just means that it’ll signal the application to call libcurl again. Internally we don’t have any identifiers to the timeouts, they’re just a time in the future we ask to be called again at. If the code needs that specific time to really have passed before doing something, the code needs to make sure the time has elapsed.

Curl_expire_latest()

A newcomer in the timeout team. I figured out we need this function since if we are in a state where we need to be called no later than a certain specific future time this is useful. It will not add a new timeout entry in the timeout list in case there’s a timeout that expires earlier than the specified time limit.

This function is useful for example when there’s a state in libcurl that varies over time but has no specific time limit to check for. Like transfer speed limits and the like. If Curl_expire() is used in this situation instead of Curl_expire_latest() it would mean adding a new timeout entry every time, and for the busy-loop API usage cases it could mean adding an excessive amount of timeout entries. (And there was a scary bug reported that got “tens of thousands of entries” which motivated this function to get added.)

timeout removals

We don’t remove timeouts from the list until they expire. Like for example if we have a condition that is timing dependent, then we set a timeout with Curl_expire() and we know we should be called again at the end of that time.

If we wouldn’t add the timeout and there’s no socket activity on the socket then we may not be called again – ever.

When an internal state transition into something else and we therefore don’t need a previously set timeout anymore, we have no handle or identifier to the timeout so it cannot be removed. It will instead lead to us getting called again when the timeout triggers even though we didn’t really need it any longer. As we’re having an API that allows this anyway, this is already handled by the logic and getting called an extra time is usually very cheap and is not considered a problem worth addressing.

Timeouts are removed automatically from the list of timers when they expire. Timeouts that are in passed time are removed from the list and the timers following will then get moved to the front of the queue and be used to calculate how long the single timeout should be next.

The only internal API to remove timeouts that we have removes all timeouts, used when cleaning up a handle.

many easy handles

I’ve mentioned how each easy handle treats their timeouts above. With the multi interface, we can have any amount of easy handles added to a single multi handle. This means one list of timeouts for each easy handle.

To handle many thousands of easy handles added to the same multi handle, all with their own timeout (as each easy handle only show their closest timeout), it builds a splay tree of easy handles sorted on the timeout time. It is a splay tree rather than a sorted list to allow really fast insertions and removals.

As soon as a timeout expires from one of the easy handles and it moves to the next timeout in its list, it means removing one node (easy handle) from the splay tree and inserting it again with the new timeout timer.