The curl activity of 2023

We have the full commit history of all curl source code since late December 1999. The reason we don’t have the history from before that moment is simply because I did not bother with that when I imported our code into Sourceforge back then. Just me being sloppy.

We know exactly when and who authored every change done to curl since then.

Development activity has gone both up and down over time since then. The number of new commit authors has increased slowly over time.

Commits

In the record year of 2004, when we still used CVS for version control, we had no CI infrastructure and I wrote 91% of all the commits myself. We made 2102 commits in the curl source tree that year. Back then without CI we did many more follow-up “oops” commits when we had to repair breakages we discovered post commit in our “autobuilds”. The quality of every individual commit was much lower then than today.

Looking through the commits and at the official history logs, there is no spectacular or special events that happened that year. It was just a period of intense development and general improvements of the code. The project was still young. We did seven curl releases in 2004, which is below the all time annual average.

The slowest (in terms of number of commits) recorded year so far was 2000, when I alone committed 100% of the 709 commits. The first year we have saved code history from.

The second most active year has been 2014 with 1745 commits. Also a year that does not really stand out when you look at features or other special things in the project. It was probably pushed that high due to an increased activity by others than me. I was just the second most commit author in curl that year. We did six releases in 2014.

In 2023 we celebrated curl’s official 25th birthday.

Today, we have merged more commits into the curl git during 2023 than we did in 2014 and 2023 is now officially the most active curl year commit-wise since 2004. With only a few days left of 2023, we can also conclude that we cannot reach the amount of 2004.

Releases

The commits of course end up in releases. This year, the shipped twelve new versions, which is over the average amount. The increased frequency is due to a modified policy where we are more likely to do follow-up patch releases when we find regressions in a release. In an effort to reduce the amount of known problems existing in the latest release.

Bugfixes

Counting these twelve releases of 2023, we released 1195 bugfixes during the year. Big and small. Code and docs. In build procedures and in the products. The rate of bugfixes has increased tremendously over the last decade. These days we average at over 3 bugfixes per day, compared to less than one ten years ago.

As a comparison: in the summary of the curl year 2012, we counted 199 bugfixes.

This is also partly explained by us keeping better records these days.

Authors

As I am typing this post, we count 187 commit authors of the year 2023 which is the exact same number we had in 2021 and a few more than last year. Out of all the committers this year, 123 authored a merged commit that for the first time. In 2021 that number was 135 so clearly we had a larger amount of return committers this year even if the total amount ended up the same.

Looking at the most active commit authors, this year broke another trend. In 2023, two authors did 50% of all the commits. As the graph below shows, I did more than 50% of the commits for the last few years. This year, Dan Fandrich made 12.6% of the commits and I did less than half of them; together we did >50%.

The number of authors needed to reach 70% and 95% of the commits show a little more varied story:

Or why not, the number of people who authored ten or more commits within the same calendar year:

My own activities

Out of the 1888 commits done in curl so far this year, I personally have done 847 (44.9%) of them.

According to GitHub, I have never had a more active year before than 2023. This increase seems unsustainable. I should probably then also add that a lot of these contributions were not done in the curl source tree repo. I was also fairly active in the repositories for the curl website, everything curl, trurl and a few others.

This year I approach the end of my fifth year working on curl full time. I maintain that I am living the dream: I work from home with my own open source “invention” with a team of awesome people. I can’t think of a better job.

Future

I stick to my old mantra: I cannot and I will not try to tell and foresee the future. I know that there will be more internet transfers, more protocols, more challenges, more bugs, more APIs, faster computers and quite likely more users, more customers and more security problems. But the details and how exactly all that is going to play out, I have no idea.

I hope we maintain a level of paying curl support customers so that I can keep living this life. I wish we get business to a degree to hire a second full time curl developer. I cross my fingers that companies and organizations will continue to pay up to get features added to curl and to sponsor maintenance, security and development in financial ways.

I have no plans to step off the curl train any time soon.

Credits

Image by Thomas Wolter from Pixabay

Making it harder to do wrong

You know I spend all my days working on curl and related matters. I also spend a lot of time thinking on the project; like how we do things and how we should do things.

The security angle of this project is one of the most crucial ones and an area where I spend a lot of time and effort. Dealing with and assessing security reports, handling the verified actual security vulnerabilities and waiving off the imaginary ones.

150 vulnerabilities

The curl project recently announced its 150th published security vulnerability and its associated CVE. 150 security problems through a period of over 25 years in a library that runs in some twenty billion installations? Is that a lot? I don’t know. Of course, the rate of incoming security reports is much higher in modern days than it was decades ago.

Out of the 150 published vulnerabilities, 60 were reported and awarded money through our bug-bounty program. In total, the curl bug-bounty has of today paid 71,400 USD to good hackers and security researchers. The monetary promise is an obvious attraction to researchers. I suppose the fact that curl also over time has grown to run in even more places, on more architectures and in even more systems also increases people’s interest in looking into and scrutinize our code. curl is without doubt one of the world’s most widely installed software components. It requires scrutiny and control. Do we hold up our promises?

curl is a C program running in virtually every internet connect device you can think of.

Trends

Another noticeable trend among the reports the last decade is that we are getting way more vulnerabilities reported with severity level low or medium these days, while historically we got more ones rated high or even critical. I think this is partly because of the promise of money but also because of a generally increased and sharpened mindset about security. Things that in the past would get overlooked and considered “just a bug” are nowadays more likely to get classified as security problems. Because we think about the problems wearing our security hats much more now.

Memory-safety

Every time we publish a new CVE people will ask about when we will rewrite curl in a memory-safe language. Maybe that is good, it means people are aware and educated on these topics.

I will not rewrite curl. That covers all languages. I will however continue to develop it, also in terms of memory-safety. This is what happens:

  1. We add support for more third party libraries written in memory-safe languages. Like the quiche library for QUIC and HTTP/3 and rustls for TLS.
  2. We are open to optionally supporting a separate library instead of native code, where that separate library could be written in a memory-safe language. Like how we work with hyper.
  3. We keep improving the code base with helper functions and style guides to reduce risks in the C code going forward. The C code is likely to remain with us for a long time forward, no matter how much the above mention areas advance. Because it is the mature choice and for many platforms still the only choice. Rust is cool, but the language, its ecosystem and its users are rookies and newbies for system library level use.

Step 1 and 2 above means that over time, the total amount of executable code in curl gradually can become more and more memory-safe. This development is happening already, just not very fast. Which is also why number 3 is important and is going to play a role for many years to come. We move forward in all of these areas at the same time, but with different speeds.

Why no rewrite

Because I’m not an expert on rust. Someone else would be a much more suitable person to lead such a rewrite. In fact, we could suspect that the entire curl maintainer team would need to be replaced since we are all old C developers maybe not the most suitable to lead and take care of a twin project written in rust. Dedicated long-term maintainer internet transfer library teams do not grow on trees.

Because rewriting is an enormous project that will introduce numerous new problems. It would take years until the new thing would be back at a similar level of rock solid functionality as curl is now.

During the initial years of the port’s “beta period”, the existing C project would continue on and we would have two separate branches to maintain and develop, more than doubling the necessary work. Users would stay on the first version until the second is considered stable, which will take a long time since it cannot become stable until it gets a huge amount of users to use it.

There is quite frankly very little (if any) actual demand for such a rewrite among curl users. The rewrite-it-in-rust mantra is mostly repeated by rust fans and people who think this is an easy answer to fixing the share of security problems that are due to C mistakes. Typically, the kind who has no desire or plans to participate in said venture.

C is unsafe and always will be

The C programming language is not memory-safe. Among the 150 reported curl CVEs, we have determined that 61 of them are “C mistakes”. Problems that most likely would not have happened had we used a memory-safe language. 40.6% of the vulnerabilities in curl reported so far could have been avoided by using another language.

Rust is virtually the only memory-safe language that is starting to become viable. C++ is not memory-safe and most other safe languages are not suitable for system/library level use. Often because how they fail to interface well with existing C/C++ code.

By June 2017 we had already made 51 C mistakes that ended up as vulnerabilities and at that time Rust was not a viable alternative yet. Meaning that for a huge portion of our problems, Rust was too late anyway.

40 is not 70

In lots of online sources people repeat that when writing code with C or C++, the share of security problems due to lack of memory-safety is in the range 60-70% of the flaws. In curl, looking per the date of when we introduced the flaws (not reported date), we were never above 50% C mistakes. Looking at the flaw introduction dates, it shows that this was true already back when the project was young so it’s not because of any recent changes.

If we instead count the share per report-date, the share has fluctuated significantly over time, as then it has depended on when people has found which problems. In 2010, the reported problems caused by C mistakes were at over 60%.

Of course, curl is a single project and not a statistical proof of any sort. It’s just a 25 year-old project written in C with more knowledge of and introspection into these details than most other projects.

Additionally, the share of C mistakes is slightly higher among the issues rated with higher severity levels: 51% (22 of 43) of the issues rated high or critical was due to C mistakes.

Help curl authors do better

We need to make it harder to write bad C code and easier to write correct C code. I do not only speak of helping others, I certainly speak of myself to a high degree. Almost every security problem we ever got reported in curl, I wrote. Including most of the issues caused by C mistakes. This means that I too need help to do right.

I have tried to learn from past mistakes and look for patterns. I believe I may have identified a few areas that are more likely than others to cause problems:

  1. strings without length restrictions, because the length might end up very long in edge cases which risks causing integer overflows which leads to issues
  2. reallocs, in particular without length restrictions and 32 bit integer overflows
  3. memory and string copies, following a previous memory allocation, maybe most troublesome when the boundary checks are not immediately next to the actual copy in the source code
  4. perhaps this is just subset of (3), but strncpy() is by itself complicated because of the padding and its not-always-null-terminating functionality

We try to avoid the above mentioned “problem areas” like this:

  1. We have general maximum length restrictions for strings passed to libcurl’s API, and we have set limits on all internally created dynamic buffers and strings.
  2. We avoid reallocs as far as possible and instead provide helper functions for doing dynamic buffers. In fact, avoiding all sorts of direct memory allocations help.
  3. Many memory copies cannot be avoided, but if we can use a pointer and length instead that is much better. If we can snprintf() a target buffer that is better. If not, try do the copy close to the boundary check.
  4. Avoid strncpy(). In most cases, it is better to just return error on too long input anyway, and then instead do plain strcpy or memcpy with the exact amount. Ideally of course, just using a pointer and the length is sufficient.

These helper functions and reduction of “difficult functions” in the code are not silver bullets. They will not magically make us avoid future vulnerabilities, they should just ideally make it harder to do security mistakes. We still need a lot of reviews, tools and testing to verify the code.

Clean code

Already before we created these helpers we have gradually and slowly over time made the code style and the requirements to follow it, stricter. When the source code looks and feels coherent, consistent, as if written by a single human, it becomes easier to read. Easier to read becomes easier to debug and easier to extend. Harder to make mistakes in.

To help us maintain a consistent code style, we have tool and CI job that runs it, so that obvious style mistakes or conformance problems end up as distinct red lines in the pull request.

Source verification

Together with the strict style requirement, we also of course run many compilers with as many picky compiler flags enabled as possible in CI jobs, we run fuzzers, valgrind, address/memory/undefined behavior sanitizers and we throw static code analyzers on the code – in a never-ending fashion. As soon as one of the tools gives a warning or indicates that something could perhaps be wrong, we fix it.

Of course also to verify the correct functionality of the code.

Data for this post

All data and numbers I speak of in this post are publicly available in the curl git repositories: curl and curl-www. The graphs come from the curl web site dashboard. All graph code is available.

curl 8.5.0

Release presentation

Numbers

the 253rd release
2 changes
56 days (total: 9,392)

188 bug-fixes (total: 9,734)
266 commits (total: 31,427)
0 new public libcurl function (total: 93)
0 new curl_easy_setopt() option (total: 303)

0 new curl command line option (total: 258)
78 contributors, 43 new (total: 3,039)
40 authors, 19 new (total: 1,219)
2 security fixes (total: 150)

Security

cookie mixed case PSL bypass

(CVE-2023-46218) This flaw allows a malicious HTTP server to set “super cookies” in curl that are then passed back to more origins than what is otherwise allowed or possible. This allows a site to set cookies that then would get sent to different and unrelated sites and domains.

It could do this by exploiting a mixed case flaw in curl’s function that verifies a given cookie domain against the Public Suffix List (PSL). For example a cookie could be set with domain=co.UK when the URL used a lower case hostname curl.co.uk, even though co.uk is listed as a PSL domain.

HSTS long file name clears contents

(CVE-2023-46219) When saving HSTS data to an excessively long file name, curl could end up removing all contents, making subsequent requests using that file unaware of the HSTS status they should otherwise use.

Changes

We have only logged two changes for this cycle.

gnutls supports CURLSSLOPT_NATIVE_CA

If you use libcurl built to use GnuTLS, you too can now set this bit and get to use the Windows CA store directly from within libcurl instead of having to provide a separate PEM file. When your application runs on Windows. libcurl already previously support this for OpenSSL and wolfSSL.

See the CURLOPT_SSL_OPTIONS documentation for details.

HTTP3 with ngtcp2 no longer experimental

The first HTTP/3 code in curl was merged into git in 2019. Now we ship HTTP/3 support non-experimental for the first time. curl supports HTTP/3 using several different backends but this news is only for HTTP/3 built to use ngtcp2 + nghttp3. The other HTTP/3 backends remain experimental for the time being.

Note that in order to be able to use ngtcp2 you need to build with a TLS library that offers the necessary API. This means you need to use one of the following libraries: wolfSSL, quictls, BoringSSL, libressl, AWS-LC or GnuTLS. Let me stress that OpenSSL is not in that list.

Bugfixes

As usual, here I have collected a few of my favorite fixes from this cycle.

improved IPFS and IPNS URL support

Turned out there were some IPFS URLs that curl did not manage properly.

doh: use PIPEWAIT when HTTP/2 is attempted

This makes curl prefer doing DoH requests multiplexed over a single connection rather than doing them as two separate connections. Most DoH lookups do one request for an IPv4 response and a separate one for IPv6.

duphandle: several OOM cleanups

If libcurl runs out of memory in the middle of the curl_easy_duphandle function, it could previously do several mistakes, including free-twice.

hostip: show the list of IPs when resolving is done

The verbose mode now shows the full list of IP addresses that was resolved from the name, before it continues to try to connect to them in a serial fashion.

aws-sigv4: canonicalize valueless query params

Turns out there was yet another glitch in the sigv4 logic that made curl send the wrong checksum for URLs using “valueless” query parameters.

hyper: temporarily remove HTTP/2 support

The hyper integration for HTTP/2 was incorrect and has therefore been disabled for now. hyper support in curl is still experimental.

drop vc10, vc11 and vc12 projects from dist, add vc14.20

We generate less project files for older Visual Studio versions but we added one for a recent version. Going forward, we might soon drop them completely from the tarballs since they can now be generated with cmake.

openssl: include SIG and KEM algorithms in verbose

curl now presents more details from the TLS handshake in the verbose output when using OpenSSL.

openssl: make CURLSSLOPT_NATIVE_CA import Windows intermediate CAs

The keyword here is intermediate. Previously, curl would ignore those which would lead to handshake errors because that is not what users expect when you use the native CA store.

openssl: when a session-ID is reused, skip OCSP stapling

When trying to do verify status, the more technical name for OCSP stapling, after a connection has been established using a session-ID, it would return error.

make SOCKS5 use the CURLOPT_IPRESOLVE choice

There are just too many combinations, but we find it likely that a user that asks for a specific IP version probably also wants it for the traffic over the SOCKS proxy.

tool: support bold headers in Windows

The feature that has been provided on other systems since 2018 now comes to Windows.

make the carriage return fit wide progress bars

The progress bar set with -#/--progress-bar had its math off by one, which made it not output the carriage return character when the terminal width was wider than 256 columns, making the output wrong.

url: find scheme with a “perfect hash”

The internal function that scans the list of supported URL schemes no longer iterates through the list but instead uses a “perfect hash”. Although much faster, that was hardly a function that caused performance problems before so it is not likely to actually be measurable.

use ALPN “http/1.1” for HTTP/1.x, including HTTP/1.0

To interoperate better with legacy servers, curl now sends http/1.1 in the ALPN field even when it wants to speak HTTP/1.0. This, because http/1.1 and h2 were the original ALPN codes and some of the old servers that support ALPN from those days don’t know about http/1.0 for ALPN.

all libcurl man pages examples compile cleanly

All (almost five hundred) libcurl man pages have EXAMPLE sections showing their use. Starting now, all of those sections are test-compiled as part of the CI builds to catch mistakes.