The curl project has been around for a long time by now and we’ve been through several different version control systems. The most recent switch was when we switched to git from CVS back in 2010. We were late switchers but then we’re conservative in several regards.
When we switched to git we also switched to github for the hosting, after having been self-hosted for many years before that. By using github we got a lot of services, goodies and reliable hosting at no cost. We’ve been enjoying that ever since.
However, as we have been a traditional mailing list driving project for a long time, I have previously not properly embraced and appreciated pull requests and issues filed at github since they don’t really follow the old model very good.
Just very recently I decided to stop fighting those methods and instead go with them. A quick poll among my fellow team mates showed no strong opposition and we are now instead going full force ahead in a more github embracing style. I hope that this will lower the barrier and remove friction for newcomers and allow more people to contribute easier.
As an effect of this, I would also like to encourage each and everyone who is interested in this project as a user of libcurl or as a contributor to and hacker of libcurl, to skip over to the curl github home and press the ‘watch’ button to get notified when future requests and issues appear.
We also offer this helpful guide on how to contribute to the curl project!
Some interesting Unicode URLs have recently been seen used in the wild – like in this billboard ad campaign from Coca Cola, and a friend of mine asked me about curl in reference to these and how it deals with such URLs.
(Picture by stevencoleuk)
I ran some tests and decided to blog my observations since they are a bit curious. The exact URL I tried was ‘www.😃.ws’ (not the same smiley as shown on this billboard: 😂) – it is really hard to enter by hand so now is the time to appreciate your ability to cut and paste! It appears they registered several domains for a set of different smileys.
These smileys are not really allowed IDN (where IDN means International Domain Names) symbols which make these domains a bit different. They should not (see below for details) be converted to punycode before getting resolved but instead I assume that the pure UTF-8 sequence should or at least will be fed into the name resolver function. Well, either way it should either pass in punycode or the UTF-8 string.
If curl was built to use libidn, it still won’t convert this to punycode and the verbose output says “Failed to convert www.😃.ws to ACE; String preparation failed”
curl (exact version doesn’t matter) using the stock threaded resolver
- Debian Linux (glibc 2.19) – FAIL
- Windows 7 - FAIL
- Mac OS X 10.9 – SUCCESS
But then also perhaps to no surprise, the exact same results are shown if I try to ping those host names on these systems. It works on the mac, it fails on Linux and Windows. Wget 1.16 also fails on my Debian systems (just as a reference and I didn’t try it on any of the other platforms).
My curl build on Linux that uses c-ares for name resolving instead of glibc succeeds perfectly. host, nslookup and dig all work fine with it on Linux too (as well as nslookup on Windows):
$ host www.😃.ws www.\240\159\152\131.ws has address 18.104.22.168$ ping www.😃.ws ping: unknown host www.😃.ws
While the same command sequence on the mac shows:
$ host www.😃.ws www.\240\159\152\131.ws has address 22.214.171.124$ ping www.😃.ws PING www.😃.ws (126.96.36.199): 56 data bytes 64 bytes from 188.8.131.52: icmp_seq=0 ttl=44 time=191.689 ms 64 bytes from 184.108.40.206: icmp_seq=1 ttl=44 time=191.124 ms
Slightly interesting additional tidbit: if I rebuild curl to use gethostbyname_r() instead of getaddrinfo() it works just like on the mac, so clearly this is glibc having an opinion on how this should work when given this UTF-8 hostname.
Pasting in the URL into Firefox and Chrome works just fine. They both convert the name to punycode and use “www.xn--h28h.ws” which then resolves to the same IPv4 address.
Update: as was pointed out in a comment below, the “220.127.116.11″ IP address is not the correct IP for the site. It is just the registrar’s landing page so it sends back that response to any host or domain name in the .ws domain that doesn’t exist!
What do the IDN specs say?
This is not my area of expertise. I had to consult Patrik Fältström here to get this straightened out (but please if I got something wrong here the mistake is still all mine). Apparently this smiley is allowed in RFC 3940 (IDNA2003), but that has been replaced by RFC 5890-5892 (IDNA2008) where this is DISALLOWED. If you read the spec, this is 263A.
So, depending on which spec you follow it was a valid IDN character or it isn’t anymore.
What does the libc docs say?
The POSIX docs for getaddrinfo doesn’t contain enough info to tell who’s right but it doesn’t forbid UTF-8 encoded strings. The regular glibc docs for getaddrinfo also doesn’t say anything and interestingly, the Apple Mac OS X version of the docs says just as little.
With this complete lack of guidance, it is hardly any additional surprise that the glibc gethostbyname docs also doesn’t mention what it does in this case but clearly it doesn’t do the same as getaddrinfo in the glibc case at least.
What’s on the actual site?
A redirect to www.emoticoke.com which shows a rather boring page.
I don’t know. What do you think?
“given enough eyeballs, all bugs are shallow”
The saying (also known as Linus’ law) doesn’t say that the bugs are found fast and neither does it say who finds them. My version of the law would be much more cynical, something like: “eventually, bugs are found“, emphasizing the ‘eventually’ part.
(Jim Zemlin apparently said the other day that it can work the Linus way, if we just fund the eyeballs to watch. I don’t think that’s the way the saying originally intended.)
Because in reality, many many bugs are never really found by all those given “eyeballs” in the first place. They are found when someone trips over a problem and is annoyed enough to go searching for the culprit, the reason for the malfunction. Even if the code is open and has been around for years it doesn’t necessarily mean that any of all the people who casually read the code or single-stepped over it will actually ever discover the flaws in the logic. The last few years several world-shaking bugs turned out to have existed for decades until discovered. In code that had been read by lots of people – over and over.
So sure, in the end the bugs were found and fixed. I would argue though that it wasn’t because the projects or problems were given enough eyeballs. Some of those problems were found in extremely popular and widely used projects. They were found because eventually someone accidentally ran into a problem and started digging for the reason.
Time until discovery in the curl project
I decided to see how it looks in the curl project. A project near and dear to me. To take it up a notch, we’ll look only at security flaws. Not only because they are the probably most important bugs we’ve had but also because those are the ones we have the most carefully noted meta-data for. Like when they were reported, when they were introduced and when they were fixed.
We have no less than 30 logged vulnerabilities for curl and libcurl so far through-out our history, spread out over the past 16 years. I’ve spent some time going through them to see if there’s a pattern or something that sticks out that we should put some extra attention to in order to improve our processes and code. While doing this I gathered some random info about what we’ve found so far.
On average, each security problem had been present in the code for 2100 days when fixed – that’s more than five and a half years. On average! That means they survived about 30 releases each. If bugs truly are shallow, it is still certainly not a fast processes.
Perhaps you think these 30 bugs are really tricky, deeply hidden and complicated logic monsters that would explain the time they took to get found? Nope, I would say that every single one of them are pretty obvious once you spot them and none of them take a very long time for a reviewer to understand.
This first graph (click it for the large version) shows the period each problem remained in the code for the 30 different problems, in number of days. The leftmost bar is the most recent flaw and the bar on the right the oldest vulnerability. The red line shows the trend and the green is the average.
The trend is clearly that the bugs are around longer before they are found, but since the project is also growing older all the time it sort of comes naturally and isn’t necessarily a sign of us getting worse at finding them. The average age of flaws is aging slower than the project itself.
Reports per year
How have the reports been distributed over the years? We have a fairly linear increase in number of lines of code but yet the reports were submitted like this (now it goes from oldest to the left and most recent on the right – click for the large version):
Compare that to this chart below over lines of code added in the project (chart from openhub and shows blanks in green, comments in grey and code in blue, click it for the large version):
We received twice as many security reports in 2014 as in 2013 and we got half of all our reports during the last two years. Clearly we have gotten more eyes on the code or perhaps users pay more attention to problems or are generally more likely to see the security angle of problems? It is hard to say but clearly the frequency of security reports has increased a lot lately. (Note that I here count the report year, not the year we announced the particular problems, as they sometimes were done on the following year if the report happened late in the year.)
On average, we publish information about a found flaw 19 days after it was reported to us. We seem to have became slightly worse at this over time, the last two years the average has been 25 days.
Did people find the problems by reading code?
In general, no. Sure people read code but the typical pattern seems to be that people run into some sort of problem first, then dive in to investigate the root of it and then eventually they spot or learn about the security problem.
(This conclusion is based on my understanding from how people have reported the problems, I have not explicitly asked them about these details.)
Common patterns among the problems?
I went over the bugs and marked them with a bunch of descriptive keywords for each flaw, and then I wrote up a script to see how the frequent the keywords are used. This turned out to describe the flaws more than how they ended up in the code. Out of the 30 flaws, the 10 most used keywords ended up like this, showing number of flaws and the keyword:
I don’t think it is surprising that TLS, HTTP or certificate checking are common areas of security problems. TLS and certs are complicated, HTTP is huge and not easy to get right. curl is mostly C so buffer overflows is a mistake that sneaks in, and I don’t think 27% of the problems tells us that this is a problem we need to handle better. Also, only 2 of the last 15 flaws (13%) were buffer overflows.
Sunday 13:00, embedded room (Lameere)
Embedded devices are very often network connected these days. Network connected embedded devices often need to transfer data to and from them as clients, using one or more of the popular internet protocols.
libcurl is the world’s most used and most popular internet transfer library, already used in every imaginable sort of embedded device out there. How did this happen and how do you use libcurl to transfer data to or from your device?
Note that this talk was originally scheduled to be at a different time!
Sunday, 09:00 Mozilla room (UD2.218A)
Title: HTTP/2 right now
HTTP/2 is the new version of the web’s most important and used protocol. Version 2 is due to be out very soon after FOSDEM and I want to inform the audience about what’s going on with the protocol, why it matters to most web developers and users and not the last what its status is at the time of FOSDEM.
curl and libcurl 7.40.0 was just released this morning. There’s a closer look at some of the perhaps more noteworthy changes. As usual, you can find the entire changelog on the curl web site.
HTTP over unix domain sockets
So just before the feature window closed for the pending 7.40.0 release of curl, Peter Wu’s patch series was merged that brings the ability to curl and libcurl to do HTTP over unix domain sockets. This is a feature that’s been mentioned many times through the history of curl but never previously truly implemented. Peter also very nicely adjusted the test server and made two test cases that verify the functionality.
To use this with the curl command line, you specify the socket path to the new –unix-domain option and assuming your local HTTP server listens on that socket, you’ll get the response back just as with an ordinary TCP connection.
Doing the operation from libcurl means using the new CURLOPT_UNIX_SOCKET_PATH option.
This feature is actually not limited to HTTP, you can do all the TCP-based protocols except FTP over the unix domain socket, but it is to my knowledge only HTTP that is regularly used this way. The reason FTP isn’t supported is of course its use of two connections which would be even weirder to do like this.
SMB is also known as CIFS and is an old network protocol from the Microsoft world access files. curl and libcurl now support this protocol with SMB:// URLs thanks to work by Bill Nagel and Steve Holme.
Last year we had a large amount of security advisories published (eight to be precise), and this year we start out with two fresh ones already on the 8th day… The ones this time were of course discovered and researched already last year.
CVE-2014-8151 is a way we accidentally allowed an application to bypass the TLS server certificate check if a TLS Session-ID was already cached for a non-checked session – when using the Mac OS SecureTransport SSL backend.
CVE-2014-8150 is a URL request injection. When letting curl or libcurl speak over a HTTP proxy, it would copy the URL verbatim into the HTTP request going to the proxy, which means that if you craft the URL and insert CRLFs (carriage returns and linefeed characters) you can insert your own second request or even custom headers into the request that goes to the proxy.
You may enjoy taking a look at the curl vulnerabilities table.
Bugs bugs bugs
The release notes mention no less than 120 specific bug fixes, which in comparison to other releases is more than average.
- Popular corner stones of open source stacks and internet servers
- Mostly run and maintained by volunteers
- Mature projects that have been around since “forever”
- Projects believed to be fairly stable and relatively trustworthy by now
- A myriad of features, switches and code that build on many platforms, with some parts of code only running on a rare few
- Written in C in a portable style
Does it sound like the curl project to you too? It does to me. Sure, this description also matches a slew of other projects but I lead the curl development so let me stay here and focus on this project.
Are we in jeopardy? I honestly don’t know, but I want to explain what we do in our project in order to minimize the risk and maximize our ability to find problems on our own before they become serious attack vectors somewhere!
There’s no secret that we have let security problems slip through at times. We’re right now working toward our 143rd release during our around 16 years of life-time. We have found and announced 28 security problems over the years. Looking at these found problems, it is clear that very few security problems are discovered quickly after introduction. Most of them linger around for several years until found and fixed. So, realistically speaking based on history: there are security bugs still in the code, and they have probably been present for a while already.
code reviews and code standards
We try to review all patches from people without push rights in the project. It would probably be a good idea to review all patches before they go in for real, but that just wouldn’t work with the (lack of) man power we have in the project while we at the same time want to develop curl, move it forward and introduce new things and features.
We maintain code standards and formatting to keep code easy to understand and follow. We keep individual commits smallish for easier review now or in the future.
As simple as it is, we test that the basic stuff works. We don’t and can’t test everything but having test cases for most things give us the confidence to change code when we see problems as we then remain fairly sure things keep working the same way as long as the test go through. In projects with much less test coverage, you become much more conservative with what you dare to change and that also makes you more vulnerable.
We always want more test cases and we want to improve on how we always add test cases when we add new features and ideally we should also add new test cases when we fix bugs so that we know that we don’t introduce any such bug again in the future.
static code analyzes
We regularly scan our code base using static code analyzers. Both clang-analyzer and coverity are good tools, and they help us by pointing out code that look wrong or suspicious. By making sure we have very few or no such flaws left in the code, we minimize the risk. A static code analyzer is better than run-time tools for cases where they can check code flows that are hard to repeat in my local environment.
Valgrind is an awesome tool to detect memory problems in run-time. Leaks or just stupid uses of memory or related functions. We have our test suite automatically use valgrind when it runs tests in case it is present and it helps us make sure that all situations we test for are also error-free from valgrind’s point of view.
Building and testing curl on a plethora of platforms non-stop is also useful to make sure we don’t depend on behaviors of particular library implementations or non-standard features and more. Testing it all is basically the only way to make sure everything keeps working over the years while we continue to develop and fix bugs. We would course be even better off with more platforms that would test automatically and with more developers keeping an eye on problems that show up there…
Arguably, one of the best ways to avoid security flaws and bugs in general, is to keep the source code as simple as possible. Complex functions need to be broken down into smaller functions that are possible to read and understand. A good way to identify functions suitable for fixing is pmccabe,
essential third parties
curl and libcurl are usually built to use a whole bunch of third party libraries in order to perform all the functionality. In order to not have any of those uses turn into a source for trouble we must of course also participate in those projects and help them stay strong and make sure that we use them the proper way that doesn’t lead to any bad side-effects.
You can help!
All this takes time, energy and system resources. Your contributions and help will be appreciated where ever among these tasks that you can insert any. We could do more of all this, more often and more thorough if we only were more people involved!
On October 25, 2005 I sent out the announcement about “libcurl funding from the Swedish IIS Foundation“. It was the beginning of what would eventually become the curl_multi_socket_action() function and its related API features. The API we provide for event-driven applications. This API is the most suitable one in libcurl if you intend to scale up your client up to and beyond hundreds or thousands of simultaneous transfers.
Thanks to this funding from IIS, I could spend a couple of months working full-time on implementing the ideas I had. They paid me the equivalent of 19,000 USD back then. IIS is the non-profit foundation that runs the .se TLD and they fund projects that help internet and internet usage, in particular in Sweden. IIS usually just call themselves “.se” (dot ess ee) these days.
Event-based programming isn’t generally the easiest approach so most people don’t easily take this route without careful consideration, and also if you want your event-based application to be portable among multiple platforms you also need to use an event-based library that abstracts the underlying function calls. These are all reasons why this remains a niche API in libcurl, used only by a small portion of users. Still, there are users and they seem to be able to use this API fine. A success in my eyes.
Part of that improvement project to make libcurl scale and perform better, was also to introduce HTTP pipelining support. I didn’t quite manage that part with in the scope of that project but the pipelining support in libcurl was born in that period (autumn 2006) but had to be improved several times over the years until it became decently good just a few years ago – and we’re just now (still) fixing more pipelining problems.
On December 10, 2014 there are exactly 3333 days since that initial announcement of mine. I’d like to highlight this occasion by thanking IIS again. Thanks IIS!
These days I’m spending a part of my daytime job working on curl with my employer’s blessing and that’s the funding I have – most of my personal time spent is still spare time. I certainly wouldn’t mind seeing others help out, but the best funding is provided as pure man power that can help out and not by trying to buy my time to add your features. Also, I will decline all (friendly) offers to host the web site on your servers since we already have a fairly stable and reliable infrastructure sponsored.
I’m not aware of anyone else that are spending (much) paid work time on curl code, although I’m know there are quite a few who do it every now and then – especially to fix problems that occur in commercial products or services or to add features to such.
IIS still donates money to internet related projects in Sweden but I never applied for any funding from them again. Mostly because it has been hard to sync with my normal life and job situation. If you’re a Swede or just live in Sweden, do consider checking this out for your next internet adventure!
(Recap: I founded the curl project, I am still the lead developer and maintainer)
When asking curl to get a URL it’ll send the output to stdout by default. You can of course easily change this behavior with options or just using your shell’s redirect feature, but without any option it’ll spew it out to stdout. If you’re invoking the command line on a shell prompt you’ll immediately get to see the response as soon as it arrives.
I decided curl should work like this, and it was a natural decision I made already when I worked on the predecessors during 1997 or so that later would turn into curl.
On Unix systems there’s a common mantra that “everything is a file” but also in fact that “everything is a pipe”. You accomplish things on Unix by piping the output of one program into the input of another program. Of course I wanted curl to work as good as the other components and I wanted it to blend in with the rest. I wanted curl to feel like cat but for a network resource. And cat is certainly not the only pre-curl command that writes to stdout by default; they are plentiful.
And then: once I had made that decision and I released curl for the first time on March 20, 1998: the call was made. The default was set. I will not change a default and hurt millions of users. I rather continue to be questioned by newcomers, but now at least I can point to this blog post! :-)
About the wget rivalry
As I mention in my curl vs wget document, a very common comment to me about curl as compared to wget is that wget is “easier to use” because it needs no extra argument in order to download a single URL to a file on disk. I get that, if you type the full commands by hand you’ll use about three keys less to write “wget” instead of “curl -O”, but on the other hand if this is an operation you do often and you care so much about saving key presses I would suggest you make an alias anyway that is even shorter and then the amount of options for the command really doesn’t matter at all anymore.
I put that argument in the same category as the people who argue that wget is easier to use because you can type it with your left hand only on a qwerty keyboard. Sure, that is indeed true but I read it more like someone trying to come up with a reason when in reality there’s actually another one underneath. Sometimes that other reason is a philosophical one about preferring GNU software (which curl isn’t) or one that is licensed under the GPL (which wget is) or simply that wget is what they’re used to and they know its options and recognize or like its progress meter better.
I enjoy our friendly competition with wget and I seriously and honestly think it has made both our projects better and I like that users can throw arguments in our face like “but X can do Y”and X can alter between curl and wget depending on which camp you talk to. I also really like wget as a tool and I am the occasional user of it, just like most Unix users. I contribute to the wget project well, both with code and with general feedback. I consider myself a friend of the current wget maintainer as well as former ones.