Category Archives: cURL and libcurl

curl and/or libcurl related

Three static code analyzers compared

I’m a fan of static code analyzing. With the use of fancy scanner tools we can get detailed reports about source code mishaps and quite decently pinpoint what source code that is suspicious and may contain bugs. In the old days we used different lint versions but they were all annoying and very often just puked out far too many warnings and errors to be really useful.

Out of coincidence I ended up getting analyses done (by helpful volunteers) on the curl 7.26.0 source base with three different tools. An excellent opportunity for me to compare them all and to share the outcome and my insights of this with you, my friends. Perhaps I should add that the analyzed code base is 100% pure C89 compatible C code.

Some general observations

First out, each of the three tools detected several issues the other two didn’t spot. I would say this indicates that these tools still have a lot to improve and also that it actually is worth it to run multiple tools against the same source code for extra precaution.

Secondly, the libcurl source code has some known peculiarities that admittedly is hard for static analyzers to figure out and not alert with false positives. For example we have several macros that look like functions and on several platforms and build combinations they evaluate as nothing, which causes dead code to be generated. Another example is that we have several cases of vararg-style functions and these functions are documented to work in ways that the analyzers don’t always figure out (both clang-analyzer and Coverity show problems with these).

Thirdly, the same lesson we knew from the lint days is still true. Tools that generate too many false positives are really hard to work with since going through hundreds of issues that after analyses turn out to be nothing makes your eyes sore and your head hurt.

Fortify

The first report I got was done with Fortify. I had heard about this commercial tool before but I had never seen any results from a run but now I did. The report I got was a PDF containing 629 pages listing 1924 possible issues among the 130,000 lines of code in the project.

fortify-curl

Fortify claimed 843 possible buffer overflows. I quickly got bored trying to find even one that could lead to a problem. It turns out Fortify has a very short attention span and warns very easily on lots of places where a very quick glance by a human tells us there’s nothing to be worried about. Having hundreds and hundreds of these is really tedious and hard to work with.

If we’re kind we call them all false positives. But sometimes it is more than so, some of the alerts are plain bugs like when it warns on a buffer overflow on this line, warning that it may write beyond the buffer. All variables are ‘int’ and as we know sscanf() writes an integer to the passed in variable for each %d instance.

sscanf(ptr, "%d.%d.%d.%d", &int1, &int2, &int3, &int4);

I ended up finding and correcting two flaws detected with Fortify, both were cases where memory allocation failures weren’t handled properly.

LLVM, clang-analyzer

Given the exact same code base, clang-analyzer reported 62 potential issues. clang is an awesome and free tool. It really stands out in the way it clearly and very descriptive explains exactly how the code is executed and which code paths that are selected when it reaches the passage is thinks might be problematic.

clang-analyzer report - click for larger version

The reports from clang-analyzer are in HTML and there’s a single file for each issue and it generates a nice looking source code with embedded comments about which flow that was followed all the way down to the problem. A little snippet from a genuine issue in the curl code is shown in the screenshot I include above.

Coverity

Given the exact same code base, Coverity detected and reported 118 issues. In this case I got the report from a friend as a text file, which I’m sure is just one output version. Similar to Fortify, this is a proprietary tool.

coverity curl report - click for larger version

As you can see in the example screenshot, it does provide a rather fancy and descriptive analysis of the exact the code flow that leads to the problem it suggests exist in the code. The function referenced in this shot is a very large function with a state-machine featuring many states.

Out of the 118 issues, many of them were actually the same error but with different code paths leading to them. The report made me fix at least 4 accurate problems but they will probably silence over 20 warnings.

Coverity runs scans on open source code regularly, as I’ve mentioned before, including curl so I’ve appreciated their tool before as well.

Conclusion

From this test of a single source base, I rank them in this order:

  1. Coverity – very accurate reports and few false positives
  2. clang-analyzer – awesome reports, missed slightly too many issues and reported slightly too many false positives
  3. Fortify – the good parts drown in all those hundreds of false positives

curl’s new HTTP cookies docs

Whenever libcurl saves cookies to a file, it saves them in the old “Netscape cookie format”.cookie

Since ages ago libcurl has written a comment at the top of that cookie file (the file we refer to as the cookie jar in curl lingo) that explains that it is a cookie netscape cookie file format and then a URL pointing to the original Netscape document describing what cookies are and how they work. To be perfectly honest, we pointed to a local copy of that document since the original is since long removed and gone and now only available through archive.org.

The web page pointed to were never a perfect match since the page never explained the file format or its use, only the cookie protocol – as it was originally described to work and not even like cookies works today.

Starting with curl 7.27.0, it will write the header to point to another URL: http://curl.haxx.se/docs/http-cookies.html

cURL

Is there a case for a unified SSL front?

There are many, sorry very many, different SSL libraries today that various programs may want to use. In the open source world at least, it is more and more common that SSL (and other crypto) using programs offer build options to build with either at least OpenSSL or GnuTLS and very often they also offer optinal build with NSS and possibly a few other SSL libraries.

In the curl project we just added support for library number nine. In the libcurl source code we have an internal API that each SSL library backend must provide, and all the libcurl source code is internally using only that single and fixed API to do SSL and crypto operations without even knowing which backend library that is actually providing the functionality. I talked about libcurl’s internal SSL API before, and I asked about this on the libcurl list back in Feb 2011.

So, a common problem should be able to find a common solution. What if we fixed this in a way that would be possible for many projects to re-use? What if one project’s ability to select from 9 different provider libraries could be leveraged by others. A single SSL API with a simplified API but that still provides the functionality most “simple” SSL-using applications need?

Marc Hörsken and I have discussed this a bit, and pidgin/libpurble came up as a possible contender that could use such a single SSL. I’ve also talked about it with Claes Jakobsson and Magnus Hagander for postgresql and I know since before that wget certainly could use it. When I’ve performed my talks on the seven SSL libraries of libcurl I’ve been approached by several people who have expressed a desire in seeing such an externalized API, and I remember the guys from cyassl among those. I’m sure there will be a few other interested parities as well if this takes off.

What remains to be answered is if it is possible to make it reality in a decent way.

Upsides:

  1. will reduce code from the libcurl code base – in the amount of 10K or more lines of C code, and it should be a decreased amount of “own” code for all projects that would decide to use thins single SSL library
  2. will allow other projects to use one out of many SSL libraries, hopefully benefiting end users as a result
  3. should get more people involved in the code as more projects would use it, hopefully ending up in better tested and polished source code

There are several downsides with a unified library, from the angle of curl/libcurl and myself:

  1. it will undoubtedly lead to more code being added and implemented that curl/libcurl won’t need or use, thus grow the code base and the final binary
  2. the API of the unified library won’t be possible to be as tightly integrated with the libcurl internals as it is today, so there will be some added code and logic needed
  3. it will require that we produce a lot of documentation for all the functions and structs in the API which takes time and effort
  4. the above mention points will deduct time from my other projects (hopefully to benefit others, but still)

Some problems that would have to dealt with:

  1. how to deal with libraries that don’t provide functionality that the single SSL API does
  2. how to draw the line of what functionality to offer in the API, as the individual libraries will always provide richer and more complete APIs
  3. What is actually needed to get the work on this started for real? And if we do, what would we call the project…

PS, I know the term is more accurately “TLS” these days but somehow people (me included) seem to have gotten stuck with the word SSL to cover both SSL and TLS…

darwin native SSL for curl

I recently mentioned the new schannel support for libcurl that allows libcurl to do SSL natively without the use of any external libraries on Windows.

This “getting native support” obviously triggered Nick Zitzmann who stepped up and sent in Secure Transport support – the native API for doing SSL on Mac OS X and iOS. This ninth supported SSL library is now called ‘darwinssl’ in the curl code base. There have been some follow-up commits too to cleanup things and make use of that API for providing the necessary function calls when doing NTLM too etc.

This functionality is merged in to curl’s master git repository and will be part of the upcoming curl 7.27.0 release, planned to hit the public at the end of July 2012.

It could be noted that if your for example build curl/libcurl to also support SCP and SFTP, you’d be linking with libssh2 for that and libssh2 is still relying on a crypto library that is either OpenSSL or gcrypt so you may in fact still end up linking with a 3rd party crypto library… Nick mentioned in a separate mail how he has looked into making libssh2 use the Secure Transport API, but that he faced some issues regarding big numbers which made him hesitate and consider how to move forward.

schannel support in libcurl

schannel is the API Microsoft provides to allow applications to for example implement SSL natively, without needing any third part library.

On Monday June 11th we merged the 30+ commits Marc Hörsken brought us. This is now the 8th SSL variation supported by libcurl, and I figure this is going to become fairly popular now in the Windows camp coming the next release: curl 7.27.0.

So now my old talk about the seven SSL libraries libcurl supported has become outdated…

It can be worth noting that as long as you build (lib)curl to also support SCP and SFTP, powered by libssh2, that library will still require a separate crypto library and libssh2 supports to get built with either OpenSSL or gcrypt. Marc mentioned that he might work on making that one use schannel as well.

cURL

curling the metalink

metalink_logo

Back in 2005 Anthony Bryan started to work with his metalink idea, as can be read in this early 2006 article. Very simplified, Metalink is a way to tell a client how to download the same identical file from many places potentially in parallel. Anthony tells me he had the idea much earlier than so, going back to a bad experience trying to download a Fedora ISO from a download mirror…

Anthony’s and my discussions about metalink started in September 2006 and we’ve bounced countless of mails and ideas back and forth since then. Even more, we’ve become friends and we’ve worked together on several related subjects as well, including several Internet Drafts within the IETF.

We had a metalink discussion on the libcurl mailing list back in April 2008 about whether to have libcurl support it natively or not, but we (I) ended up with the conclusion that it wasn’t fit for libcurl. Basically because metalink is a layer on top of the application protocols that libcurl supports.

I wasn’t quite prepared at that time to accept the patches for the curl tool since I didn’t like all the XML stuff it would bring in and as I recall it I felt that I wasn’t prepared to deal with that extra work load at the time. I think I told the guys I wanted to wait and see and try it more at a later point.

In September that same year I blogged about Anthony’s work on getting an internet draft done for metalink. That would later in 2010 get released as RFC5854 and a year later RFC6249 came out with a way to provide all the info in HTTP headers instead of XML as the previous document was for. (Both RFCs contain acknowledgements to yours truly as contributor.)

Today

While I said metalink wasn’t really fit for libcurl, it was always fit for curl – the command line client that uses libcurl but is more of a transfer tool. During the spring 2012 Anthony and super-hacker Tatsuhiro Tsujikawa approached me and asked if perhaps we were ready for metalink in curl this time?

Yes!

Since the last time, metalink has developed as a standard and there’s now a libmetalink project to use and I felt it was a good time development wise as well. Tatsuhiro whipped up a refreshed patch in no time and soon we were polishing off the last little edges around the corners and the metalink patch set was merged into curl 7.27.0! Anthony’s and Tatsuhiro’s persistence and patience over the years are impressive. Thanks a lot my friends! That’s a little over five and a half years since the first approach until it got merged into the mainline sources. That’s nothing but pure dedication.

Usage

So, starting with curl 7.27.0 and assuming you built curl with the correct set of prereqs installed, this is how you use it:

curl --metalink [URL]

Where the URL is a URL that points to a metalink file, and then curl will download the file from one of the URLs mentioned. curl will at this point try them serially if there are multiple ones specified and not in parallel. Room for future improvements.

curl 7.27.0 will probably be released in the end of July 2012, but you can already get an early test version as a daily snapshot. We’ll appreciate all feedback you can give us!

550M users

(This text has been updated since first post. It used to say 300 million but then I missed all iOS devices…)

Ok, so here’s a little ego game. The rules are very simple: try to figure out all things I’ve written code in (to any noticeable degree) and count how many users the products that use such code might have. Then estimate the total amount of humans that may in fact use my code from time to time.

I’ve been doing software both for fun and professionally for over 20 years (my first code I made available to others was written in 1986 on the C64). But as I look back on what I’ve done at my day job for all this time, most of my labor have been hidden into some sort of devices or equipment that never really were distributed to many customers. I don’t think I’ve ever done software professionally for consumer stuff. My open source code however has found its way into all sorts of things so I decided I could limit this count to open source code I’ve done. It is also slightly easier. Or perhaps less hard. And when it comes to open source, none of my other projects is as popular and widely used as curl. Counting curl users will drown all others.

First some basic stats: the curl.haxx.se web site gets more than 12000 unique visitors every weekday. curl packages are downloaded from there at a rate of roughly 1 million times/year. The site sends over 200GB of data every month. We have no idea how large share of users who get curl from the main site, but a guess is that it is far less than half of the user base. But of course the number of downloads says nothing about how many users there are.

Mac OS X ships with curl (and libcurl?) by default. There are perhaps 86 million macs in the world.

libcurl is used in television sets and Bluray players made by at least five major brands (LG, Panasonic, Philips, Sony and Toshiba). I’m convinced they don’t use it in all models but probably just a few of their higher end internet-connected ones. 10% of the total? It seems in 2009 there were 35 million flat panel TVs sold in the US with a forecast of the sales growing slightly over the years. I figure that would mean perhaps 100 million ones sold in the US the three last years possibly made by these brands (and lets assume that includes some Blu-rays too), and lets say that is half the world market for them, it would make libcurl shipping in 20 million something TVs.

curl and libcurl are installed by default in some Linux distributions but not in all. In Debian it is an optional extra and the popcon overview shows perhaps 70% of Debian users install libcurl (and 56% use libssh2). Lets assume that’s a suitable average for all desktop Linux users. How many are we? Let’s for the sake of the argument say that 3% of all computers using the internet run Linux. Some numbers say there are 2.3 billion internet users. It would make 70 million Linux computers and thus 49 million libcurl installations. Roughly.

Open Office and the recent spin-off LibreOffice are both using libcurl. Open Office said they have 100 million users now in May 2012.

Games: Second Life, Warhammer 40000, Ghost Recon, Need for speed world, Game Face and “Saints Row: The third” all use libcurl. The first game alone boasts over 20 million registered users. I couldn’t find any numbers for any other game I know uses libcurl.

Other embedded uses: libcurl and libssh2 are both announced as supported packages of Wind River Linux, the perhaps most dominant provider of embedded Linux and another leading provider is Montavista which also offers curl and libcurl. How many users? I have absolutely no idea. I’d say more than just a few, but how many? Impossible to tell so let’s ignore that possibly huge install base. Spotify uses or at least used libcurl, and early 2012 they had 15 million users.

Phones. libcurl is shipped in iOS and WebOS and it is used by RIM and Apple for some (to me) unknown purposes. Lots of applications on Android still build and use libcurl, c-ares and libssh2 for their apps but it is just impossible to estimate how many users they get. Apple has sold 250 million iOS devices, at least. (This little number was missed by me in the calculation I first posted.)

ios-credits

Infrastructure. libcurl is used in the Tornado web server made by Friendfeed/Facebook and it is used by significant services at Yahoo.com. How many users of said services? Surely many millions. But really, that would be users of just 2 libcurl users so let’s not rush ahead and count those as direct users!

libcurl powers the very popular PHP/CURL extension that a large amount of PHP-running sites have enabled and use. How many? In 2008, 33% of all internet sites run PHP. Let’s say the share has decreased to 30% since then and the total amount of active sites is now 200M. That makes 60M PHP sites, and if there’s 10% of them using PHP/CURL we’re talking 6 million users.

Development. git, darcs, bazaar and Mercurial are all children of the distribution version control systems (some of them very popular) and they all use libcurl. How many users do they have? Since they’re all working on multiple platforms I would estimate the number of users of them collectively to be in the tens of millions range. Let’s say 10 million.

86 + 20 + 49 + 100 + 20 + 15 + 250 + 6 + 10 = 556 million users

550-million

And yes, of course a lot of these users will be the same actual human. But I may also just have counted all the numbers completely wrong to start with. I would say I’m probably within the correct magnitude!

550 million users out of the world’s 2.3 billion internet users. 1 out of 4 are using something that runs code I wrote. Kind of cool!

Sweden has a population of less than 10 million. 550 million is almost twice the entire USA, four times the population of Russia or almost eight times the population of Germany… As a comparison to some big browsers, a recent article claims Google Chrome has 200 million users in April 2012 which may be around 25% of the browser market and showing that basically none of the individual browsers have a lot more users than 300 million…

Of course I know that every single person who reads this is a knowing or unknowing user… Can you think of any other major users?

shorter HTTP requests for curl

Starting in curl 7.26.0 (due to be released at the end of May 2012), we will shrink the User-agent: header that curl sends by default in HTTP(S) requests to something much shorter! I suspect that this will raise some eyebrows out there so even though I’ve emailed about it to the curl-users list before I thought I’d better write it up and elaborate.

A default ‘curl localhost’ on Debian Linux makes 170 bytes get sent in that single request:

GET / HTTP/1.1
User-Agent: curl/7.24.0 (i486-pc-linux-gnu) libcurl/7.24.0 OpenSSL/1.0.0g zlib/1.2.6 libidn/1.23 libssh2/1.2.8 librtmp/2.3
Host: localhost
Accept: */*

As you can see, the user-agent description takes up a large portion of that request, and this for really no good reason at all. Without sacrificing any functionality I shrunk the same request down to 71 bytes:

GET / HTTP/1.1
User-Agent: curl/7.24.0
Host: localhost
Accept: */*

That means we shrunk it down to 41% of the original size. I’ll admit the example is a bit extreme and most other normal use cases will use longer host names and longer paths, but even for a URL like “https://daniel.haxx.se/docs/curl-vs-wget.html” we’re down to 50% of the original request size (100 vs 199).

Can we shrink it even more? Sure, we could leave out the version number too. I left it in there now only to allow some kind of statistics to get extracted. We can’t remove the entire header, we need to include a user-agent in requests since there are too many servers who won’t function properly otherwise.

And before anyone asks: this change is only for the curl command line tool and not for libcurl, the library. libcurl does in fact not send any user-agent at all by default…

NFS has many meanings

Today I learned that Need for speed World (I first had to google what “NFS-world” actually means) uses curl when I received this email:

From: [removed]
Subject: NFS-world

I can not go into the game for 4 months my nickname “[removed]”. it writes the error “Login failed, please try again.” Please solve this problem. Support Group does not help.

But no, I don’t know why this guy emailed me…

I then went on to look for other Electronic Arts games using libcurl, and I fell over these forum posts that clearly indicate Game Face uses it, but I found no credits or other information page online.

Can you find any other?

The updated web scraping howto

webbots-spiders-and-screen-scrapers

Web scraping is a practice that is basically as old as the web. The desire to extract contents or to machine- generate things from what perhaps was primarily intended to be presented to a browser and to humans pops up all the time.

When I first created the first tool that would later turn into curl back in 1997, it was for the purpose of scraping. When I added more protocols beyond the initial HTTP support it too was to extend its abilities to “scrape” contents for me.

I’ve not (yet!) met Michael Schrenk in person, although I’ve communicated with him back and forth over the years and back in 2007 I got a copy of his book Webbots, Spiders and Screen Scrapers in its 1st edition. Already then I liked it to the extent that I posted this positive little review on the curl-and-php mailing list saying:

this book is a rare exception and previously unmatched to my knowledge in how it covers PHP/CURL. It explains to great details on how to write web clients using PHP/CURL, what pitfalls there are, how to make your code behave well and much more.

Fast-forward to the year 2011. I was contacted by Mike and his publisher at Nostarch, and I was asked to review the book with special regards to protocol facts and curl usage. I didn’t hesitate but gladly accepted as I liked the first edition already and I believe an updated version could be useful to people.

Now, in the early 2012 Mike’s efforts have turned out into a finished second edition of his book. With updated contents and a couple of new chapters, it is refreshed and extended. The web has changed since 2007 and so has this book! I hope that my contributions didn’t only annoy Mike but possibly I helped a little bit to make it even more accurate than the original version. If you find technical or factual errors in this edition, don’t feel shy to tell me (and Mike of course) about them!