Category Archives: Open Source

Open Source, Free Software, and similar

The curl user survey 2017

May 11, 2017 Daniel Stenberg 2 Comments

The annual survey for curl and libcurl users is open. The 2017 edition has some minor edits since last year but is mostly the same set of questions used before. To help us detect changes and trends over time.

If you use curl or libcurl, in any way, shape or form, please consider spending a few minutes of your precious time on this. Your input helps us understand where we are and in which direction we should go next.

Fill in the form!

The poll is open fourteen days from Friday May 12th until midnight (CEST) May 27th 2017. All data we collect is non-personal and anonymous.

To get some idea of what sort of information we extract and collect from the results, have a look at the analysis of last year’s survey.

The subsequent analysis of the 2017 user survey.

Everything curl – printed!

May 10, 2017 Daniel Stenberg 1 Comment

TLDR: fill in your info in this form if you want to buy a print copy!

Long time curl friend and contributor Dan Fandrich printed a (very limited) first edition of Everything curl on real actual dead-tree paper a while ago. Getting this rather heavy thing in your hand is actually an awesome feeling and quite different to just reading it on a screen!

However, those few initial copies were quickly given away to interested readers and there are none left now.

We are now investigating if there is still interest from people in getting one of these physical, hard copy versions, of the book. The price is likely to be about 20 Euros including International shipping. The first edition of the book is a 232 page professionally-printed and bound softcover book. The second edition is planned to be very similar.

The content of the first edition book was picked from the book’s git repository in March 2017 and is not the intended final version of the book. Who knows if there will ever be a final version. There are ‘tbd’ markers on many places in the book where additional content is meant to be added in a future.

To sign up for your own copy of the book and you are willing to pay around 20 Euros for one, please fill in your contact information in this Google form, and we if we get enough proof of interest we might get a second edition printed.

You buy this book because you want a physical version of it. All the contents is already available for free online, in PDF version and in two e-book formats. The money charged for the book will not go to the curl project but is for printing and shipping.

cURL and libcurl

Improving timers in libcurl

May 10, 2017 Daniel Stenberg

A few years ago I explained the timer and timeout concepts of the libcurl internals. A decent prerequisite for this post really.

Today, I introduced “timer IDs” internally in libcurl so that all expiring timers we use has to specify which timer it is with an ID, and we only have a set number of IDs to select from. Turns out there are only about 10 of them. 10 different ones. Per easy handle.

With this change, we now only allow one running timer for each ID, which then makes it possible for us to change timers during execution so that they never “fire” in vain like they used to do (since we couldn’t stop or remove them them before they expired previously). This change makes event loops slightly more efficient since now they will end up getting much fewer “spurious” timeouts that happen only because we had no infrastructure to prevent them.

Another benefit from only keeping one single timer for each ID in the list of timers, is that the dynamic list of running timers immediately become much shorter. This, because many times the same timer ID was used again and we would then add a new node to the list so the timer that had one purpose would expire twice (or more). But now it no longer works like that. In some typical uses I’ve tested, I’ve seen the list shrink from a maximum of 7-8 nodes down to a maximum of 1 or 2.

Finally, since we now have a finite number of timers that can be set at any given time and we know the maximum amount is fairly small, I could make the timer code completely skip using dynamic memory. Allocating 10 tiny structs as part of the main handle allocation is more efficient than doing tiny mallocs() for them one by one. In a basic comparison test I’ve run, this reduced the total number of allocations from 82 to 72 for “curl localhost”.

This change will be included in the pending curl release targeted to ship on June 14th 2017. Possibly called version 7.54.1.

Those are all in a tree

As explained previously: the above explanation of timers goes for the set of timers kept for each individual easy handle, and with libcurl you can add an unlimited amount of easy handles to a multi handle (to perform lots of transfers in parallel) and then the multi handle has a self-balanced splay tree with the nearest-in-time timer for each individual easy handle as nodes in the tree, so that it can quickly and easily figure out which handle that needs attention next and when in time that is.

The illustration below shows a captured imaginary moment in time when there are five easy handles in different colors, all doing their own separate transfers, Each easy handle has three private timers set. The tree contains five nodes and the root of the tree is the node representing the the easy handle that needs to be taken care of next (in time). It also means we immediately know exactly how long time there is left until libcurl needs to act next.

Expiry

As soon “time N” occurs or expires, libcurl takes care of what the yellow handle needs to do and then removes that timer node from the tree. The yellow handle then advances the next timer first in line and the tree gets re-adjusted accordingly so that the new yellow first-node gets re-inserted positioned at the right place in the tree.

In our imaginary case here, the new yellow time N (formerly known as N + 1) is now later in time than L, but before M. L is now nearest in time and the tree has now adjusted to look something like this:

Since the tree only really cares about the root timer for each handle, you also see how adding a new timeout to single easy handle that isn’t the next in time is a really quick operation. It just adds a node in a linked list – per that specific handle. The linked list which now has a maximum length that is capped to the total amount of different timers: 10.

Straight-forward!

Blogging, cURL and libcurl, Haxx, Network, Open Source

A curl delivery network

May 2, 2017 Daniel Stenberg 1 Comment

I’ve run my own public web sites on hardware I’ve administered myself for over twenty years now. I’ve hosted the curl web site myself since it’s inception.

The curl web site at curl.haxx.se has recently been delivering roughly 1.5 terabyte of data to the world per month. The CA bundle we convert to PEM from the Mozilla source code, is alone downloaded more than 100,000 times per day. Occasional blog entries I’ve posted here on my blog have climbed very fast on popular sites such as Hacker news and Reddit, and have resulted in intense visitor storms hitting this same server – sometimes reaching visitor counts above 200,000 “uniques” – most of them within the first few hours of the publication. At times, those visitor spikes have effectively brought the server to its knees.

Yes, my personal web site and the curl web site are both sharing the same physical server. It also hosts more than a dozen other sites and numerous services for our own pleasures and fun, providing services for a handful of different open source projects. So when the server has to cease doing work because it runs out of memory or hits other resource restraints, that causes interruptions all over. Oh yes, and my email doesn’t reach me.

Inconvenient and annoying.

The server

Haxx owns and runs this co-located server that we have a busload of web servers on – for the good of the projects and people that run things on it. This machine’s worst bottle neck is available RAM memory and perhaps I/O performance. Every time the server goes down to a crawl due to network traffic overload we discuss how we should upgrade it. Installing a new machine and transferring over all the sites and services is work. Work that none of us at Haxx are very happy to volunteer to do. So it hasn’t been done yet, and frankly the server handles the daily load just fine and without even a blink. Which is ninety nine point something percent of the time…

Haxx pays for a certain amount of network traffic so as long as we’re below some threshold we remain paying the same monthly fee. We don’t want to increase the traffic by magnitudes as that would cost more.

The specific machine, that sits deep inside a server room in Stockholm Sweden, is a five(?) years old Dell Poweredge E310, Intel Xeon X3440 2.53GHz with 8GB ram, This model is shown on the image at the top.

Alternatives that hasn’t helped

Why not a mirror system? We had a fair amount of curl site mirrors a few years ago, but it never worked well because they were always less reliable than the main site and they often turned stale and out of sync with the master site which eventually just hurt users.

They also trick visitors into bookmark or otherwise go back to the mirror site instead of the real one and there were always the annoying people who couldn’t resist but to fill the mirror with ads and stuff. Plus, they didn’t help much with with the storms to the main site.

Why not a cloud server? Because with the amount of services, servers and various things we do on our server, it would be inconvenient and expensive. But perhaps even more because we started out like this so we have invested time and energy into the infrastructure as it works right now. And I enjoy rowing my own boat!

The CDN

Fastly reached out and graciously offered to help us handle the load. Both on the account of traffic amounts but also to save our machine from struggling this hard the next time I’ll write something that tickles people’s curiosity (or rage) to that level when several thousands of visitors want to read the same article at the same time.

Starting now, the curl.haxx.se and the daniel.haxx.se web sites are fronted by Fastly. It should give web site visitors from all over the world faster response times and it will make the site more reliable and less likely to have problems due to traffic load going forward.

In case you’re not familiar with what a CDN is, a simplified explanation would say it is a globally distributed network of reverse proxy servers deployed in multiple data centers. These CDN servers front the Internet and will to the largest extent possible serve the visitors with the right content directly from their own caches instead of them reaching the actual lowly backend server I run that hosts the original content. Fastly has lots of servers across the globe for this purpose. Users who are a long way away from Sweden will probably be the ones who will notice this change the most, as you may suddenly find haxx.se content much closer (network round-trip wise) than before.

Standards

These new servers will host the sites over HTTPS just like before, and they will require TLS 1.2 and SNI. They will work over IPv6 and support HTTP/2. Network standard wise, there shouldn’t be any step down – and honestly, I haven’t exactly been on the cutting edge of these technologies myself for these sites in the past.

Editing the site

We will keep editing and maintaining the site like before. It is made up of an old system with templates and include files that generate mostly static web pages. The site is mostly available on github and using that, you can build a local version for development and trying out changes before they land.

Hopefully, this move to Fastly will only make the site faster and more reliable. If you notice any glitches or experience any problems with the site, please let us know!

cURL and libcurl, Network

Fewer mallocs in curl

April 22, 2017 Daniel Stenberg 16 Comments

Today I landed yet another small change to libcurl internals that further reduces the number of small mallocs we do. This time the generic linked list functions got converted to become malloc-less (the way linked list functions should behave, really).

Instrument mallocs

I started out my quest a few weeks ago by instrumenting our memory allocations. This is easy since we have our own memory debug and logging system in curl since many years. Using a debug build of curl I run this script in my build dir:

#!/bin/sh
export CURL_MEMDEBUG=$HOME/tmp/curlmem.log
./src/curl http://localhost
./tests/memanalyze.pl -v $HOME/tmp/curlmem.log

For curl 7.53.1, this counted about 115 memory allocations. Is that many or a few?

The memory log is very basic. To give you an idea what it looks like, here’s an example snippet:

MEM getinfo.c:70 free((nil))
MEM getinfo.c:73 free((nil))
MEM url.c:294 free((nil))
MEM url.c:297 strdup(0x559e7150d616) (24) = 0x559e73760f98
MEM url.c:294 free((nil))
MEM url.c:297 strdup(0x559e7150d62e) (22) = 0x559e73760fc8
MEM multi.c:302 calloc(1,480) = 0x559e73760ff8
MEM hash.c:75 malloc(224) = 0x559e737611f8
MEM hash.c:75 malloc(29152) = 0x559e737a2bc8
MEM hash.c:75 malloc(3104) = 0x559e737a9dc8

Check the log

I then studied the log closer and I realized that there were many small memory allocations done from the same code lines. We clearly had some rather silly code patterns where we would allocate a struct and then add that struct to a linked list or a hash and that code would then subsequently add yet another small struct and similar – and then often do that in a loop. (I say we here to avoid blaming anyone, but of course I myself am to blame for most of this…)

Those two allocations would always happen in pairs and they would be freed at the same time. I decided to address those. Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.

So, fixing the hash code and the linked list code to not use mallocs were immediate and easy ways to remove over 20% of the mallocs for a plain and simple ‘curl http://localhost’ transfer.

At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.

Recount

Today, the script from above shows that the same “curl localhost” command is down to 80 allocations from the 115 curl 7.53.1 used. Without sacrificing anything really. An easy 26% improvement. Not bad at all!

But okay, since I modified curl_multi_wait() I wanted to also see how it actually improves things for a slightly more advanced transfer. I took the multi-double.c example code, added the call to initiate the memory logging, made it uses curl_multi_wait() and had it download these two URLs in parallel:

http://www.example.com/
http://localhost/512M

The second one being just 512 megabytes of zeroes and the first being a 600 bytes something public html page. Here’s the count-malloc.c code.

First, I brought out 7.53.1 and built the example against that and had the memanalyze script check it:

Mallocs: 33901
Reallocs: 5
Callocs: 24
Strdups: 31
Wcsdups: 0
Frees: 33956
Allocations: 33961
Maximum allocated: 160385

Okay, so it used 160KB of memory totally and it did over 33,900 allocations. But ok, it downloaded over 512 megabytes of data so it makes one malloc per 15KB of data. Good or bad?

Back to git master, the version we call 7.54.1-DEV right now – since we’re not quite sure which version number it’ll become when we release the next release. It can become 7.54.1 or 7.55.0, it has not been determined yet. But I digress, I ran the same modified multi-double.c example again, ran memanalyze on the memory log again and it now reported…

Mallocs: 69
Reallocs: 5
Callocs: 24
Strdups: 31
Wcsdups: 0
Frees: 124
Allocations: 129
Maximum allocated: 153247

I had to look twice. Did I do something wrong? I better run it again just to double-check. The results are the same no matter how many times I run it…

33,961 vs 129

curl_multi_wait() is called a lot of times in a typical transfer, and it had at least one of the memory allocations we normally did during a transfer so removing that single tiny allocation had a pretty dramatic impact on the counter. A normal transfer also moves things in and out of linked lists and hashes a bit, but they too are mostly malloc-less now. Simply put: the remaining allocations are not done in the transfer loop so they’re way less important.

The old curl did 263 times the number of allocations the current does for this example. Or the other way around: the new one does 0.37% the number of allocations the old one did…

As an added bonus, the new one also allocates less memory in total as it decreased that amount by 7KB (4.3%).

Are mallocs important?

In the day and age with many gigabytes of RAM and all, does a few mallocs in a transfer really make a notable difference for mere mortals? What is the impact of 33,832 extra mallocs done for 512MB of data?

To measure what impact these changes have, I decided to compare HTTP transfers from localhost and see if we can see any speed difference. localhost is fine for this test since there’s no network speed limit, but the faster curl is the faster the download will be. The server side will be equally fast/slow since I’ll use the same set for both tests.

I built curl 7.53.1 and curl 7.54.1-DEV identically and ran this command line:

curl http://localhost/80GB -o /dev/null

80 gigabytes downloaded as fast as possible written into the void.

The exact numbers I got for this may not be totally interesting, as it will depend on CPU in the machine, which HTTP server that serves the file and optimization level when I build curl etc. But the relative numbers should still be highly relevant. The old code vs the new.

7.54.1-DEV repeatedly performed 30% faster! The 2200MB/sec in my build of the earlier release increased to over 2900 MB/sec with the current version.

The point here is of course not that it easily can transfer HTTP over 20 Gigabit/sec using a single core on my machine – since there are very few users who actually do that speedy transfers with curl. The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.

On the cost of malloc: The 512MB test I did resulted in 33832 more allocations using the old code. The old code transferred HTTP at a rate of about 2200MB/sec. That equals 145,827 mallocs/second – that are now removed! A 600 MB/sec improvement means that curl managed to transfer 4300 bytes extra for each malloc it didn’t do, each second.

Was removing these mallocs hard?

Not at all, it was all straight forward. It is however interesting that there’s still room for changes like this in a project this old. I’ve had this idea for some years and I’m glad I finally took the time to make it happen. Thanks to our test suite I could do this level of “drastic” internal change with a fairly high degree of confidence that I don’t introduce too terrible regressions. Thanks to our APIs being good at hiding internals, this change could be done completely without changing anything for old or new applications.

(Yeah I haven’t shipped the entire change in a release yet so there’s of course a risk that I’ll have to regret my “this was easy” statement…)

Caveats on the numbers

There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.

More?

Are there more “low hanging fruits” to pick here in the similar vein?

Perhaps. We don’t do a lot of performance measurements or comparisons so who knows, we might do more silly things that we could stop doing and do even better. One thing I’ve always wanted to do, but never got around to, was to add daily “monitoring” of memory/mallocs used and how fast curl performs in order to better track when we unknowingly regress in these areas.

Addendum, April 23rd

(Follow-up on some comments on this article that I’ve read on hacker news, Reddit and elsewhere.)

Someone asked and I ran the 80GB download again with ‘time’. Three times each with the old and the new code, and the “middle” run of them showed these timings:

Old code:

real    0m36.705s
user    0m20.176s
sys     0m16.072s

New code:

real    0m29.032s
user    0m12.196s
sys     0m12.820s

The server that hosts this 80GB file is a standard Apache 2.4.25, and the 80GB file is stored on an SSD. The CPU in my machine is a core-i7 3770K 3.50GHz.

Someone also mentioned alloca() as a solution for one of the patches, but alloca() is not portable enough to work as the sole solution, meaning we would have to do ugly #ifdef if we would want to use alloca() there.

cURL and libcurl, Security

curl bug bounty

April 19, 2017 Daniel Stenberg 1 Comment

The curl project is a project driven by volunteers with no financing at all except for a few sponsors who pay for the server hosting and for contributors to work on features and bug fixes on work hours. curl and libcurl are used widely by companies and commercial software so a fair amount of work is done by people during paid work hours.

This said, we don’t have any money in the project. Nada. Zilch. We can’t pay bug bounties or hire people to do specific things for us. We can only ask people or companies to volunteer things or services for us.

This is not a complaint – far from it. It works really well and we have a good stream of contributions, bugs reports and more. We are fortunate enough to make widely used software which gives our project a certain impact in the world.

Bug bounty!

Hacker One coordinates a bug bounty program for flaws that affects “the Internet”, and based on previously paid out bounties, serious flaws in libcurl match that description and can be deemed worthy of bounties. For example, 3000 USD was paid for libcurl: URL request injection (the curl advisory for that flaw) and 1000 USD was paid for libcurl duphandle read out of bounds (the corresponding curl advisory).

I think more flaws in libcurl could’ve met the criteria, but I suspect more people than me haven’t been aware of this possibility for bounties.

I was glad to find out that this bounty program pays out money for libcurl issues and I hope it will motivate people to take an extra look into the inner workings of libcurl and help us improve.

What qualifies?

The bounty program is run and administered completely out of control or insight from the curl project itself and I must underscore that while libcurl issues can qualify, their emphasis is on fixing vulnerabilities in Internet software that have a potentially big impact.

To qualify for this bounty, vulnerabilities must meet the following criteria:

Be implementation agnostic: the vulnerability is present in implementations from multiple vendors or a vendor with dominant market share. Do not send vulnerabilities that only impact a single website, product, or project.
Be open source: finding manifests itself in at least one popular open source project.

In addition, vulnerabilities should meet most of the following criteria:

Be widespread: vulnerability manifests itself across a wide range of products, or impacts a large number of end users.
Have critical impact: vulnerability has extreme negative consequences for the general public.
Be novel: vulnerability is new or unusual in an interesting way.

If your libcurl security flaw matches this, go ahead and submit your request for a bounty. If you’re at a company using libcurl at scale, consider joining that program as a bounty sponsor!

Firefox, Network, Technology

Talk: web transport, today and tomorrow

April 13, 2017 Daniel Stenberg

At the Netnod spring meeting 2017 in Stockholm on the 5th of April I did a talk with the title of this post.

Why was HTTP/2 introduced, how well has HTTP/2 been deployed and used, did it deliver on its promises, where doesn’t HTTP/2 perform as well. Then a quick (haha) overview on what QUIC is and how it intends to fix some of the shortcomings of HTTP/2 and TCP. In 28 minutes.

cURL and libcurl, Development, Open Source, Security

Yes C is unsafe, but…

March 30, 2017 Daniel Stenberg 25 Comments

I posted curl is C a few days ago and it raced on hacker news, reddit and elsewhere and got well over a thousand comments in those forums alone. The blog post has been read more than 130,000 times so far.

Addendum a few days later

Many commenters of my curl is C post struck down on my claim that most of our security flaws aren’t due to curl being written in C. It turned out into some sort of CVE counting game in some of the threads.

I think that’s missing the point I was trying to make. Even if 75% of them happened due to us using C, that fact alone would still not be a strong enough reason for me to reconsider our language of choice (at this point in time). We use C for a whole range of reasons as I tried to lay out there in spite of the security challenges the language brings. We know C has tricky corners and we know we are likely to do more mistakes going forward.

curl is currently one of the most distributed and most widely used software components in the universe, be it open or proprietary and there are easily way over three billion instances of it running in appliances, servers, computers and devices across the globe. Right now. In your phone. In your car. In your TV. In your computer. Etc.

If we then have had 40, 50 or even 60 security problems because of us using C, through-out our 19 years of history, it really isn’t a whole lot given the scale and time we’re talking about here.

Using another language would’ve caused at least some problems due to that language, plus I feel a need to underscore the fact that none of the memory safe languages anyone would suggest we should switch to have been around for 19 years. A portion of our security bugs were even created in our project before those alternatives you would suggest were available! Let alone as stable and functional alternatives.

This is of course no guarantee that there isn’t still more ugly things to discover or that we won’t mess up royally in the future, but who will throw the first stone when it comes to that? We will continue to work hard on minimizing risks, detecting problems early by ourselves and work closely together with everyone who reports suspected problems to us.

Number of problems as a measurement

The fact that we have 62 CVEs to date (and more will follow surely) is rather a proof that we work hard on fixing bugs, that we have an open process that deals with the problems in the most transparent way we can think of and that people are on their toes looking for these problems. You should not rate a project in any way purely based on the number of CVEs – you really need to investigate what lies behind the numbers if you want to understand and judge the situation.

Future

Let me clarify this too: I can very well imagine a future where we transition to another language or attempt various others things to enhance the project further – security wise and more. I’m not really ruling anything out as I usually only have very vague ideas of what the future might look like. I just don’t expect it to be happening within the next few years.

These “you should switch language” remarks are strangely enough from the backseat drivers of the Internet. Those who can tell us with confidence how to run our project but who don’t actually show us any code.

Languages

What perhaps made me most sad in the aftermath of said previous post, is everyone who failed to hold more than one thought at a time in their heads. In my post I wrote 800 words on some of the reasoning behind us sticking to the language C in the curl project. I specifically did not say that I dislike certain other languages or that any of those alternative languages are bad or should be avoided. Please friends, I wrote about why curl uses C. There are many fine languages out there and you should all use them as much as you possibly can, and I will too – but not in the curl project (at the moment). So no, I don’t hate language XXXX. I didn’t say so, and I didn’t imply it either. Don’t put that label on me, thanks.

cURL and libcurl, Development, Open Source

curl is C

March 27, 2017 Daniel Stenberg 29 Comments

For some reason, this post got picked up again and is debated today in 2021, almost 4 years since I wrote it. Some things have changed in the mean time and I might’ve phrased a few things differently if I had written this today. But still, what’s here below is what I wrote back then. Enjoy!

Every once in a while someone suggests to me that curl and libcurl would do better if rewritten in a “safe language”. Rust is one such alternative language commonly suggested. This happens especially often when we publish new security vulnerabilities. (Update: I think Rust is a fine language! This post and my stance here has nothing to do with what I think about Rust or other languages, safe or not.)

curl is written in C

The curl code guidelines mandate that we stick to using C89 for any code to be accepted into the repository. C89 (sometimes also called C90) – the oldest possible ANSI C standard. Ancient and conservative.

C is everywhere

This fact has made it possible for projects, companies and people to adopt curl into things using basically any known operating system and whatever CPU architecture you can think of (at least if it was 32bit or larger). No other programming language is as widespread and easily available for everything. This has made curl one of the most portable projects out there and is part of the explanation for curl’s success.

The curl project was also started in the 90s, even long before most of these alternative languages you’d suggest, existed. Heck, for a truly stable project it wouldn’t be responsible to go with a language that isn’t even old enough to start school yet.

Everyone knows C

Perhaps not necessarily true anymore, but at least the knowledge of C is very widespread, where as the current existing alternative languages for sure have more narrow audiences or amount of people that master them.

C is not a safe language

Does writing safe code in C require more carefulness and more “tricks” than writing the same code in a more modern language better designed to be “safe” ? Yes it does. But we’ve done most of that job already and maintaining that level isn’t as hard or troublesome.

We keep scanning the curl code regularly with static code analyzers (we maintain a zero Coverity problems policy) and we run the test suite with valgrind and address sanitizers.

C is not the primary reason for our past vulnerabilities

There. The simple fact is that most of our past vulnerabilities happened because of logical mistakes in the code. Logical mistakes that aren’t really language bound and they would not be fixed simply by changing language.

Of course that leaves a share of problems that could’ve been avoided if we used another language. Buffer overflows, double frees and out of boundary reads etc, but the bulk of our security problems has not happened due to curl being written in C.

C is not a new dependency

It is easy for projects to add a dependency on a library that is written in C since that’s what operating systems and system libraries are written in, still today in 2017. That’s the default. Everyone can build and install such libraries and they’re used and people know how they work.

A library in another language will add that language (and compiler, and debugger and whatever dependencies a libcurl written in that language would need) as a new dependency to a large amount of projects that are themselves written in C or C++ today. Those projects would in many cases downright ignore and reject projects written in “an alternative language”.

curl sits in the boat

In the curl project we’re deliberately conservative and we stick to old standards, to remain a viable and reliable library for everyone. Right now and for the foreseeable future. Things that worked in curl 15 years ago still work like that today. The same way. Users can rely on curl. We stick around. We don’t knee-jerk react to modern trends. We sit still in the boat. We don’t rock it.

Rewriting means adding heaps of bugs

The plain fact, that also isn’t really about languages but is about plain old software engineering: translating or rewriting curl into a new language will introduce a lot of bugs. Bugs that we don’t have today.

Not to mention how rewriting would take a huge effort and a lot of time. That energy can instead today be spent on improving curl further.

What if

If I would start the project today, would I’ve picked another language? Maybe. Maybe not. If memory safety and related issues was the primary concern I had, then sure. But as I’ve mentioned above there are several others concerns too so it would really depend on my priorities.

Finally

At the end of the day the question that remains is: would we gain more than we would pay, and over which time frame? Who would gain and who would lose?

I’m sure that there will be or it may even already exist, curl and libcurl competitors and potent alternatives written in most of these new alternative languages. Some of them are absolutely really good and will get used and reach fame and glory. Some of them will be crap. Just like software always work. Let a thousand curl competitors bloom!

Will curl be rewritten at some point in the future? I won’t rule it out, but I find it unlikely. I find it even more unlikely that it will happen in the short term or within the next few years.

Discuss this post on Hacker news or Reddit!

Followup-post: Yes, C is unsafe, but…

cURL and libcurl

curlup 2017: curl now

March 20, 2017 Daniel Stenberg

At curlup 2017 in Nuremberg, I did a keynote and talked a little about the road to what we are and where we are right now in the curl project. There will hopefully be a recording of this presentation made available soon, but I wanted to entertain you all by also presenting some of the graphs from that presentation in a blog format for easy access and to share the information.

Some stats and numbers from the curl project early 2017. Unless otherwise mentioned, this is based on the availability of data that we have. The git repository has data from December 1999 and we have detailed release information since version 6.0 (September 13, 1999).

Web traffic

First out, web site traffic to curl.haxx.se over the seven last full years that I have stats for. The switch to a HTTPS-only site happened in February 2016. The main explanation to the decrease in spent bandwidth in 2016 is us removing the HTML and PDF versions of all documentation from the release tarballs (October 2016).

My log analyze software also tries to identify “human” traffic so this graph should not include the very large amount of bots and automation that hits our site. In total we serve almost twice the amount of data to “bots” than to human. A large share of those download the cacert.pem file we host.

Since our switch to HTTPS we have a 301 redirect from the HTTP site, and we still suffer from a large number of user-agents hitting us over and over without seemingly following said redirect…

Number of lines in git

Since we also have documentation and related things this isn’t only lines of code. Plain and simply: lines added to files that we have in git, and how the number have increased over time.

There’s one notable dip and one climb and I think they both are related to how we have rearranged documentation and documentation formatting.

Top-4 author’s share

This could also talk about how seriously we suffer from “the bus factor” in this project. Look at how large share of all commits that the top-4 commiters have authored. Not committed; authored. Of course we didn’t have proper separation between authors and committers before git (March 2010).

Interesting to note here is also that the author listed second here is Yang Tse, who hasn’t authored anything since August 2013. Me personally seem to have plateaued at around 57% of all commits during the recent year or two and the top-4 share is slowly decreasing but is still over 80% of the commits.

I hope we can get the top-4 share well below 80% if I rerun this script next year!

Number of authors over time

In comparison to the above graph, I did one that simply counted the total number of unique authors that have contributed a change to git and look at how that number changes over time.

The time before git is, again, somewhat of a lie since we didn’t keep track of authors vs committers properly then so we shouldn’t put too much value into that significant knee we can see on the graph.

To me, the main take away is that in spite of the top-4 graph above, this authors-over-time line is interestingly linear and shows that the vast majority of people who contribute patches only send in one or maybe a couple of changes and then never appear again in the project.

My hope is that this line will continue to climb over the coming years.

Commits per release

We started doing proper git tags for release for curl 6.5. So how many commits have we done between releases ever since? It seems to have gone up and down over time and I added an average number line in this graph which is at about 150 commits per release (and remember that we attempt to do them every 8 weeks since a few years back).

Towards the right we can see the last 20 releases or so showing a pattern of high bar, low bar, and I’ll get to that more in a coming graph.

Of course, counting commits is a rough measurement as they can be big or small, easy or hard, good or bad and this only counts them.

Commits per day

As the release frequency has varied a bit over time I figured I should just check and see how many commits we do in the project per day and see how that has changed (or not) over time. Remember, we are increasing the number of unique authors fairly fast but the top-4 share of “authorship” is fairly stable.

Turns our the number of commits per day has gone up and down a little bit through the git history but I can’t spot any obvious trend here. In recent years we seem to keep up more than 2 commits per day and during intense periods up to 8.

Days per release

Our general plan is since bunch of years back to do releases every 8 weeks like a clock work. 8 weeks is 56 days.

When we run into serious problems, like bugs that are really annoying or tedious to users or if we get a really serious security problem reported, we sometimes decide to go outside of the regular release schedule and ship something before the end of the 8-week cycle.

This graph clearly shows that over the last, say 20, releases we clearly have felt ourselves “forced” to do follow-up releases outside of the regular schedule. The right end of the graph shows a very clear saw-tooth look that proves this.

We’ve also discussed this slightly on the mailing list recently, and I’m certainly willing to go back and listen to people as to what we can do to improve this situation.

Bugfixes per release

We keep close track of all bugfixes done in git and mark them up and mention them in the RELEASE-NOTES document that we ship in every new release.

This makes it possible for us to go back and see how many bug fixes we’ve recorded for each release since curl 6.5. This shows a clear growth over time. It’s interesting since we don’t see this when we count commits, so it may just be attributed to having gotten better at recording the bugs in the files. Or that we now spend fewer commits per bug fix. Hard to tell exactly, but I do enjoy that we fix a lot of bugs…

Days spent per bugfix

Another way to see the data above is to count the number of bug fixes we do over time and just see how many days we need on average to fix bugs.

The last few years we do more bug fixes than there are days so if we keep up the trend this shows for 2017 we might be able to reach down to 0.5 days per bug fix on average. That’d be cool!

Coverity scans

We run coverity scans on the curl cover regularly and this service keeps a little graph for us showing the number of found defects over time. These days we have a policy of never allowing a defect detected by Coverity to linger around. We fix them all and we should have zero detected defects at all times.

The second graph here shows a comparison line with “other projects of comparable size”, indicating that we’re at least not doing badly here.

Vulnerability reports

So in spite of our grand intentions and track record shown above, people keep finding security problems in curl in a higher frequency than every before.

Out of the 24 vulnerabilities reported to the curl project in 2016, 7 was the result of the special security audit that we explicitly asked for, but even if we hadn’t asked for that and they would’ve remained unknown, 17 would still have stood out in this graph.

I do however think that finding – and reporting – security problem is generally more good than bad. The problems these reports have found have generally been around for many years already so this is not a sign of us getting more sloppy in recent years, I take it as a sign that people look for these problems better and report them more often, than before. The industry as a whole looks on security problems and the importance of them differently now than it did years ago.

Those are all in a tree

Expiry

The server

Alternatives that hasn’t helped

The CDN

Standards

Editing the site

Instrument mallocs

Check the log

Recount

33,961 vs 129

Are mallocs important?

Was removing these mallocs hard?

Caveats on the numbers

More?

Addendum, April 23rd

Bug bounty!

What qualifies?

Addendum a few days later

Number of problems as a measurement

Future

Languages

curl is written in C

C is everywhere

Everyone knows C

C is not a safe language

C is not the primary reason for our past vulnerabilities

C is not a new dependency

curl sits in the boat

Rewriting means adding heaps of bugs

What if

Finally

Web traffic

Number of lines in git

Top-4 author’s share

Number of authors over time

Commits per release

Commits per day

Days per release

Bugfixes per release

Days spent per bugfix

Coverity scans

Vulnerability reports

curl, open source and networking