My upcoming FOSDEM 2024

I attended FOSDEM for the first time back in 2010. I have since been back and attended every single physical version of the conference since then (remember that it skipped a few years in the COVID days).

FOSDEM is my favorite conference no doubt.

I did a presentation in the embedded dev room in 2010. I have in fact talked in front of audiences almost every year and some years I did it more than once.

Stickers and coasters

If you are interested in getting a curl sticker or two, this is a great opportunity. I will bring a senseless amount of curl stickers in different flavors and sizes so that hopefully everyone who wants one can get one. Find me at FOSDEM to get one. Or find the wolfSSL stand (K building level 1), where I will stock up stickers and also spend time every once in a while.

You can find me at the wolfSSL booth following my talks. Saturday 11:30 – 13:00 and Sunday 11:00 – 13:00.

I will also bring fancy PCB-style coasters. You must have a commit merged in curl’s source repository to be eligible to one of those beauties receive from me.

Feel free to email or DM me on Mastodon or something for syncing.

Broom not included: curling the modern way

That’s the title of my talk in the Network devroom. It is scheduled to take place at 10:50 on Saturday February 3rd. In room UB5.230.

The talk is set to take 20 minutes (including questions). My presentation abstract from when I still naively thought I could get 40 minutes describes the talk like this:

Everyone uses curl, the Swiss army knife of Internet transfers. Earlier this year we celebrated curl’s 25th birthday, and while this tool has provided a solid set of command line options for decades, new ones are added over time This talk is a look at some of the most powerful and interesting additions to curl done in recent years. The perhaps lesser known curl tricks that might enrich your command lines, extend your “tool belt” and make you more productive. Also trurl, the recently created companion tool for URL manipulations you maybe did not realize you want.

I will have to make some tough decisions on what of all this that I can actually include…

You too could have made curl

This talk has been accepted on the main track, but is not yet scheduled.

The talk is 50 minutes. It happens at 10:00 Sunday February 4th in room K1.105 (La Fontaine).

Daniel has taken the curl project to run in some 20 billion installations. He talks about what it takes to succeed with Open Source: patience, time, ups and downs, cooperation, fighting your impostor syndrome – all while having fun. There’s no genius or magic trick behind successful open source. You can do it. The talk will of course be spiced up with anecdotes, experiences and stories from Daniel’s 25 years of leading the curl project.

More

I proposed a talk titled “HTTP/3 – why and where are we” in the Web Performance devroom but it was not accepted.

I will update this post with more info as such becomes available.

The I in LLM stands for intelligence

I have held back on writing anything about AI or how we (not) use AI for development in the curl factory. Now I can’t hold back anymore. Let me show you the most significant effect of AI on curl as of today – with examples.

Bug Bounty

Having a bug bounty means that we offer real money in rewards to hackers who report security problems. The chance of money attracts a certain amount of “luck seekers”. People who basically just grep for patterns in the source code or maybe at best run some basic security scanners, and then report their findings without any further analysis in the hope that they can get a few bucks in reward money.

We have run the bounty for a few years by now, and the rate of rubbish reports has never been a big problem. Also, the rubbish reports have typically also been very easy and quick to detect and discard. They have rarely caused any real problems or wasted our time much. A little like the most stupid spam emails.

Our bug bounty has resulted in over 70,000 USD paid in rewards so far. We have received 415 vulnerability reports. Out of those, 64 were ultimately confirmed security problems. 77 of the report were informative, meaning they typically were bugs or similar. Making 66% of the reports neither a security issue nor a normal bug.

Better crap is worse

When reports are made to look better and to appear to have a point, it takes a longer time for us to research and eventually discard it. Every security report has to have a human spend time to look at it and assess what it means.

The better the crap, the longer time and the more energy we have to spend on the report until we close it. A crap report does not help the project at all. It instead takes away developer time and energy from something productive. Partly because security work is consider one of the most important areas so it tends to trump almost everything else.

A security report can take away a developer from fixing a really annoying bug. because a security issue is always more important than other bugs. If the report turned out to be crap, we did not improve security and we missed out time on fixing bugs or developing a new feature. Not to mention how it drains you on energy having to deal with rubbish.

AI generated security reports

I realize AI can do a lot of good things. As any general purpose tool it can also be used for the wrong things. I am also sure AIs can be trained and ultimately get used even for finding and reporting security problems in productive ways, but so far we have yet to find good examples of this.

Right now, users seem keen at using the current set of LLMs, throwing some curl code at them and then passing on the output as a security vulnerability report. What makes it a little harder to detect is of course that users copy and paste and include their own language as well. The entire thing is not exactly what the AI said, but the report is nonetheless crap.

Detecting AI crap

Reporters are often not totally fluent in English and sometimes their exact intentions are hard to understand at once and it might take a few back and fourths until things reveal themselves correctly – and that is of course totally fine and acceptable. Language and cultural barriers are real things.

Sometimes reporters use AIs or other tools to help them phrase themselves or translate what they want to say. As an aid to communicate better in a foreign language. I can’t find anything wrong with that. Even reporters who don’t master English can find and report security problems.

So: just the mere existence of a few give-away signs that parts of the text were generated by an AI or a similar tool is not an immediate red flag. It can still contain truths and be a valid issue. This is part of the reason why a well-formed crap report is harder and takes longer to discard.

Exhibit A: code changes are disclosed

In the fall of 2023, I alerted the community about a pending disclosure of CVE-2023-38545. A vulnerability we graded severity high.

The day before that issue was about to be published, a user submitted this report on Hackerone: Curl CVE-2023-38545 vulnerability code changes are disclosed on the internet

That sounds pretty bad and would have been a problem if it actually was true.

The report however reeks of typical AI style hallucinations: it mixes and matches facts and details from old security issues, creating and making up something new that has no connection with reality. The changes had not been disclosed on the Internet. The changes that actually had been disclosed were for previous, older, issues. Like intended.

In this particular report, the user helpfully told us that they used Bard to find this issue. Bard being a Google generative AI thing. It made it easier for us to realize the craziness, close the report and move on. As can be seen in the report log, we did have to not spend much time on researching this.

Exhibit B: Buffer Overflow Vulnerability

A more complicated issue, less obvious, done better but still suffering from hallucinations. Showing how the problem grows worse when the tool is better used and better integrated into the communication.

On the morning of December 28 2023, a user filed this report on Hackerone: Buffer Overflow Vulnerability in WebSocket Handling. It was morning in my time zone anyway.

Again this sounds pretty bad just based on the title. Since our WebSocket code is still experimental, and thus not covered by our bug bounty it helped me to still have a relaxed attitude when I started looking at this report. It was filed by a user I never saw before, but their “reputation” on Hackerone was decent – this was not their first security report.

The report was pretty neatly filed. It included details and was written in proper English. It also contained a proposed fix. It did not stand out as wrong or bad to me. It appeared as if this user had detected something bad and as if the user understood the issue enough to also come up with a solution. As far as security reports go, this looked better than the average first post.

In the report you can see my first template response informing the user their report had been received and that we will investigate the case. When that was posted, I did not yet know how complicated or easy the issue would be.

Nineteen minutes later I had looked at the code, not found any issue, read the code again and then again a third time. Where on earth is the buffer overflow the reporter says exists here? Then I posted the first question asking for clarification on where and how exactly this overflow would happen.

After repeated questions and numerous hallucinations I realized this was not a genuine problem and on the afternoon that same day I closed the issue as not applicable. There was no buffer overflow.

I don’t know for sure that this set of replies from the user was generated by an LLM but it has several signs of it.

Ban these reporters

On Hackerone there is no explicit “ban the reporter from further communication with our project” functionality. I would have used it if it existed. Researchers get their “reputation” lowered then we close an issue as not applicable, but that is a very small nudge when only done once in a single project.

I have requested better support for this from Hackerone. Update: this function exists, I just did not look at the right place for it…

Future

As these kinds of reports will become more common over time, I suspect we might learn how to trigger on generated-by-AI signals better and dismiss reports based on those. That will of course be unfortunate when the AI is used for appropriate tasks, such as translation or just language formulation help.

I am convinced there will pop up tools using AI for this purpose that actually work (better) in the future, at least part of the time, so I cannot and will not say that AI for finding security problems is necessarily always a bad idea.

I do however suspect that if you just add an ever so tiny (intelligent) human check to the mix, the use and outcome of any such tools will become so much better. I suspect that will be true for a long time into the future as well.

I have no doubts that people will keep trying to find shortcuts even in the future. I am sure they will keep trying to earn that quick reward money. Like for the email spammers, the cost of this ends up in the receiving end. The ease of use and wide access to powerful LLMs is just too tempting. I strongly suspect we will get more LLM generated rubbish in our Hackerone inboxes going forward.

Discussion

Hacker news

Credits

Image by TungArt7

The curl activity of 2023

We have the full commit history of all curl source code since late December 1999. The reason we don’t have the history from before that moment is simply because I did not bother with that when I imported our code into Sourceforge back then. Just me being sloppy.

We know exactly when and who authored every change done to curl since then.

Development activity has gone both up and down over time since then. The number of new commit authors has increased slowly over time.

Commits

In the record year of 2004, when we still used CVS for version control, we had no CI infrastructure and I wrote 91% of all the commits myself. We made 2102 commits in the curl source tree that year. Back then without CI we did many more follow-up “oops” commits when we had to repair breakages we discovered post commit in our “autobuilds”. The quality of every individual commit was much lower then than today.

Looking through the commits and at the official history logs, there is no spectacular or special events that happened that year. It was just a period of intense development and general improvements of the code. The project was still young. We did seven curl releases in 2004, which is below the all time annual average.

The slowest (in terms of number of commits) recorded year so far was 2000, when I alone committed 100% of the 709 commits. The first year we have saved code history from.

The second most active year has been 2014 with 1745 commits. Also a year that does not really stand out when you look at features or other special things in the project. It was probably pushed that high due to an increased activity by others than me. I was just the second most commit author in curl that year. We did six releases in 2014.

In 2023 we celebrated curl’s official 25th birthday.

Today, we have merged more commits into the curl git during 2023 than we did in 2014 and 2023 is now officially the most active curl year commit-wise since 2004. With only a few days left of 2023, we can also conclude that we cannot reach the amount of 2004.

Releases

The commits of course end up in releases. This year, the shipped twelve new versions, which is over the average amount. The increased frequency is due to a modified policy where we are more likely to do follow-up patch releases when we find regressions in a release. In an effort to reduce the amount of known problems existing in the latest release.

Bugfixes

Counting these twelve releases of 2023, we released 1195 bugfixes during the year. Big and small. Code and docs. In build procedures and in the products. The rate of bugfixes has increased tremendously over the last decade. These days we average at over 3 bugfixes per day, compared to less than one ten years ago.

As a comparison: in the summary of the curl year 2012, we counted 199 bugfixes.

This is also partly explained by us keeping better records these days.

Authors

As I am typing this post, we count 187 commit authors of the year 2023 which is the exact same number we had in 2021 and a few more than last year. Out of all the committers this year, 123 authored a merged commit that for the first time. In 2021 that number was 135 so clearly we had a larger amount of return committers this year even if the total amount ended up the same.

Looking at the most active commit authors, this year broke another trend. In 2023, two authors did 50% of all the commits. As the graph below shows, I did more than 50% of the commits for the last few years. This year, Dan Fandrich made 12.6% of the commits and I did less than half of them; together we did >50%.

The number of authors needed to reach 70% and 95% of the commits show a little more varied story:

Or why not, the number of people who authored ten or more commits within the same calendar year:

My own activities

Out of the 1888 commits done in curl so far this year, I personally have done 847 (44.9%) of them.

According to GitHub, I have never had a more active year before than 2023. This increase seems unsustainable. I should probably then also add that a lot of these contributions were not done in the curl source tree repo. I was also fairly active in the repositories for the curl website, everything curl, trurl and a few others.

This year I approach the end of my fifth year working on curl full time. I maintain that I am living the dream: I work from home with my own open source “invention” with a team of awesome people. I can’t think of a better job.

Future

I stick to my old mantra: I cannot and I will not try to tell and foresee the future. I know that there will be more internet transfers, more protocols, more challenges, more bugs, more APIs, faster computers and quite likely more users, more customers and more security problems. But the details and how exactly all that is going to play out, I have no idea.

I hope we maintain a level of paying curl support customers so that I can keep living this life. I wish we get business to a degree to hire a second full time curl developer. I cross my fingers that companies and organizations will continue to pay up to get features added to curl and to sponsor maintenance, security and development in financial ways.

I have no plans to step off the curl train any time soon.

Credits

Image by Thomas Wolter from Pixabay

Making it harder to do wrong

You know I spend all my days working on curl and related matters. I also spend a lot of time thinking on the project; like how we do things and how we should do things.

The security angle of this project is one of the most crucial ones and an area where I spend a lot of time and effort. Dealing with and assessing security reports, handling the verified actual security vulnerabilities and waiving off the imaginary ones.

150 vulnerabilities

The curl project recently announced its 150th published security vulnerability and its associated CVE. 150 security problems through a period of over 25 years in a library that runs in some twenty billion installations? Is that a lot? I don’t know. Of course, the rate of incoming security reports is much higher in modern days than it was decades ago.

Out of the 150 published vulnerabilities, 60 were reported and awarded money through our bug-bounty program. In total, the curl bug-bounty has of today paid 71,400 USD to good hackers and security researchers. The monetary promise is an obvious attraction to researchers. I suppose the fact that curl also over time has grown to run in even more places, on more architectures and in even more systems also increases people’s interest in looking into and scrutinize our code. curl is without doubt one of the world’s most widely installed software components. It requires scrutiny and control. Do we hold up our promises?

curl is a C program running in virtually every internet connect device you can think of.

Trends

Another noticeable trend among the reports the last decade is that we are getting way more vulnerabilities reported with severity level low or medium these days, while historically we got more ones rated high or even critical. I think this is partly because of the promise of money but also because of a generally increased and sharpened mindset about security. Things that in the past would get overlooked and considered “just a bug” are nowadays more likely to get classified as security problems. Because we think about the problems wearing our security hats much more now.

Memory-safety

Every time we publish a new CVE people will ask about when we will rewrite curl in a memory-safe language. Maybe that is good, it means people are aware and educated on these topics.

I will not rewrite curl. That covers all languages. I will however continue to develop it, also in terms of memory-safety. This is what happens:

  1. We add support for more third party libraries written in memory-safe languages. Like the quiche library for QUIC and HTTP/3 and rustls for TLS.
  2. We are open to optionally supporting a separate library instead of native code, where that separate library could be written in a memory-safe language. Like how we work with hyper.
  3. We keep improving the code base with helper functions and style guides to reduce risks in the C code going forward. The C code is likely to remain with us for a long time forward, no matter how much the above mention areas advance. Because it is the mature choice and for many platforms still the only choice. Rust is cool, but the language, its ecosystem and its users are rookies and newbies for system library level use.

Step 1 and 2 above means that over time, the total amount of executable code in curl gradually can become more and more memory-safe. This development is happening already, just not very fast. Which is also why number 3 is important and is going to play a role for many years to come. We move forward in all of these areas at the same time, but with different speeds.

Why no rewrite

Because I’m not an expert on rust. Someone else would be a much more suitable person to lead such a rewrite. In fact, we could suspect that the entire curl maintainer team would need to be replaced since we are all old C developers maybe not the most suitable to lead and take care of a twin project written in rust. Dedicated long-term maintainer internet transfer library teams do not grow on trees.

Because rewriting is an enormous project that will introduce numerous new problems. It would take years until the new thing would be back at a similar level of rock solid functionality as curl is now.

During the initial years of the port’s “beta period”, the existing C project would continue on and we would have two separate branches to maintain and develop, more than doubling the necessary work. Users would stay on the first version until the second is considered stable, which will take a long time since it cannot become stable until it gets a huge amount of users to use it.

There is quite frankly very little (if any) actual demand for such a rewrite among curl users. The rewrite-it-in-rust mantra is mostly repeated by rust fans and people who think this is an easy answer to fixing the share of security problems that are due to C mistakes. Typically, the kind who has no desire or plans to participate in said venture.

C is unsafe and always will be

The C programming language is not memory-safe. Among the 150 reported curl CVEs, we have determined that 61 of them are “C mistakes”. Problems that most likely would not have happened had we used a memory-safe language. 40.6% of the vulnerabilities in curl reported so far could have been avoided by using another language.

Rust is virtually the only memory-safe language that is starting to become viable. C++ is not memory-safe and most other safe languages are not suitable for system/library level use. Often because how they fail to interface well with existing C/C++ code.

By June 2017 we had already made 51 C mistakes that ended up as vulnerabilities and at that time Rust was not a viable alternative yet. Meaning that for a huge portion of our problems, Rust was too late anyway.

40 is not 70

In lots of online sources people repeat that when writing code with C or C++, the share of security problems due to lack of memory-safety is in the range 60-70% of the flaws. In curl, looking per the date of when we introduced the flaws (not reported date), we were never above 50% C mistakes. Looking at the flaw introduction dates, it shows that this was true already back when the project was young so it’s not because of any recent changes.

If we instead count the share per report-date, the share has fluctuated significantly over time, as then it has depended on when people has found which problems. In 2010, the reported problems caused by C mistakes were at over 60%.

Of course, curl is a single project and not a statistical proof of any sort. It’s just a 25 year-old project written in C with more knowledge of and introspection into these details than most other projects.

Additionally, the share of C mistakes is slightly higher among the issues rated with higher severity levels: 51% (22 of 43) of the issues rated high or critical was due to C mistakes.

Help curl authors do better

We need to make it harder to write bad C code and easier to write correct C code. I do not only speak of helping others, I certainly speak of myself to a high degree. Almost every security problem we ever got reported in curl, I wrote. Including most of the issues caused by C mistakes. This means that I too need help to do right.

I have tried to learn from past mistakes and look for patterns. I believe I may have identified a few areas that are more likely than others to cause problems:

  1. strings without length restrictions, because the length might end up very long in edge cases which risks causing integer overflows which leads to issues
  2. reallocs, in particular without length restrictions and 32 bit integer overflows
  3. memory and string copies, following a previous memory allocation, maybe most troublesome when the boundary checks are not immediately next to the actual copy in the source code
  4. perhaps this is just subset of (3), but strncpy() is by itself complicated because of the padding and its not-always-null-terminating functionality

We try to avoid the above mentioned “problem areas” like this:

  1. We have general maximum length restrictions for strings passed to libcurl’s API, and we have set limits on all internally created dynamic buffers and strings.
  2. We avoid reallocs as far as possible and instead provide helper functions for doing dynamic buffers. In fact, avoiding all sorts of direct memory allocations help.
  3. Many memory copies cannot be avoided, but if we can use a pointer and length instead that is much better. If we can snprintf() a target buffer that is better. If not, try do the copy close to the boundary check.
  4. Avoid strncpy(). In most cases, it is better to just return error on too long input anyway, and then instead do plain strcpy or memcpy with the exact amount. Ideally of course, just using a pointer and the length is sufficient.

These helper functions and reduction of “difficult functions” in the code are not silver bullets. They will not magically make us avoid future vulnerabilities, they should just ideally make it harder to do security mistakes. We still need a lot of reviews, tools and testing to verify the code.

Clean code

Already before we created these helpers we have gradually and slowly over time made the code style and the requirements to follow it, stricter. When the source code looks and feels coherent, consistent, as if written by a single human, it becomes easier to read. Easier to read becomes easier to debug and easier to extend. Harder to make mistakes in.

To help us maintain a consistent code style, we have tool and CI job that runs it, so that obvious style mistakes or conformance problems end up as distinct red lines in the pull request.

Source verification

Together with the strict style requirement, we also of course run many compilers with as many picky compiler flags enabled as possible in CI jobs, we run fuzzers, valgrind, address/memory/undefined behavior sanitizers and we throw static code analyzers on the code – in a never-ending fashion. As soon as one of the tools gives a warning or indicates that something could perhaps be wrong, we fix it.

Of course also to verify the correct functionality of the code.

Data for this post

All data and numbers I speak of in this post are publicly available in the curl git repositories: curl and curl-www. The graphs come from the curl web site dashboard. All graph code is available.

curl 8.5.0

Release presentation

Numbers

the 253rd release
2 changes
56 days (total: 9,392)

188 bug-fixes (total: 9,734)
266 commits (total: 31,427)
0 new public libcurl function (total: 93)
0 new curl_easy_setopt() option (total: 303)

0 new curl command line option (total: 258)
78 contributors, 43 new (total: 3,039)
40 authors, 19 new (total: 1,219)
2 security fixes (total: 150)

Security

cookie mixed case PSL bypass

(CVE-2023-46218) This flaw allows a malicious HTTP server to set “super cookies” in curl that are then passed back to more origins than what is otherwise allowed or possible. This allows a site to set cookies that then would get sent to different and unrelated sites and domains.

It could do this by exploiting a mixed case flaw in curl’s function that verifies a given cookie domain against the Public Suffix List (PSL). For example a cookie could be set with domain=co.UK when the URL used a lower case hostname curl.co.uk, even though co.uk is listed as a PSL domain.

HSTS long file name clears contents

(CVE-2023-46219) When saving HSTS data to an excessively long file name, curl could end up removing all contents, making subsequent requests using that file unaware of the HSTS status they should otherwise use.

Changes

We have only logged two changes for this cycle.

gnutls supports CURLSSLOPT_NATIVE_CA

If you use libcurl built to use GnuTLS, you too can now set this bit and get to use the Windows CA store directly from within libcurl instead of having to provide a separate PEM file. When your application runs on Windows. libcurl already previously support this for OpenSSL and wolfSSL.

See the CURLOPT_SSL_OPTIONS documentation for details.

HTTP3 with ngtcp2 no longer experimental

The first HTTP/3 code in curl was merged into git in 2019. Now we ship HTTP/3 support non-experimental for the first time. curl supports HTTP/3 using several different backends but this news is only for HTTP/3 built to use ngtcp2 + nghttp3. The other HTTP/3 backends remain experimental for the time being.

Note that in order to be able to use ngtcp2 you need to build with a TLS library that offers the necessary API. This means you need to use one of the following libraries: wolfSSL, quictls, BoringSSL, libressl, AWS-LC or GnuTLS. Let me stress that OpenSSL is not in that list.

Bugfixes

As usual, here I have collected a few of my favorite fixes from this cycle.

improved IPFS and IPNS URL support

Turned out there were some IPFS URLs that curl did not manage properly.

doh: use PIPEWAIT when HTTP/2 is attempted

This makes curl prefer doing DoH requests multiplexed over a single connection rather than doing them as two separate connections. Most DoH lookups do one request for an IPv4 response and a separate one for IPv6.

duphandle: several OOM cleanups

If libcurl runs out of memory in the middle of the curl_easy_duphandle function, it could previously do several mistakes, including free-twice.

hostip: show the list of IPs when resolving is done

The verbose mode now shows the full list of IP addresses that was resolved from the name, before it continues to try to connect to them in a serial fashion.

aws-sigv4: canonicalize valueless query params

Turns out there was yet another glitch in the sigv4 logic that made curl send the wrong checksum for URLs using “valueless” query parameters.

hyper: temporarily remove HTTP/2 support

The hyper integration for HTTP/2 was incorrect and has therefore been disabled for now. hyper support in curl is still experimental.

drop vc10, vc11 and vc12 projects from dist, add vc14.20

We generate less project files for older Visual Studio versions but we added one for a recent version. Going forward, we might soon drop them completely from the tarballs since they can now be generated with cmake.

openssl: include SIG and KEM algorithms in verbose

curl now presents more details from the TLS handshake in the verbose output when using OpenSSL.

openssl: make CURLSSLOPT_NATIVE_CA import Windows intermediate CAs

The keyword here is intermediate. Previously, curl would ignore those which would lead to handshake errors because that is not what users expect when you use the native CA store.

openssl: when a session-ID is reused, skip OCSP stapling

When trying to do verify status, the more technical name for OCSP stapling, after a connection has been established using a session-ID, it would return error.

make SOCKS5 use the CURLOPT_IPRESOLVE choice

There are just too many combinations, but we find it likely that a user that asks for a specific IP version probably also wants it for the traffic over the SOCKS proxy.

tool: support bold headers in Windows

The feature that has been provided on other systems since 2018 now comes to Windows.

make the carriage return fit wide progress bars

The progress bar set with -#/--progress-bar had its math off by one, which made it not output the carriage return character when the terminal width was wider than 256 columns, making the output wrong.

url: find scheme with a “perfect hash”

The internal function that scans the list of supported URL schemes no longer iterates through the list but instead uses a “perfect hash”. Although much faster, that was hardly a function that caused performance problems before so it is not likely to actually be measurable.

use ALPN “http/1.1” for HTTP/1.x, including HTTP/1.0

To interoperate better with legacy servers, curl now sends http/1.1 in the ALPN field even when it wants to speak HTTP/1.0. This, because http/1.1 and h2 were the original ALPN codes and some of the old servers that support ALPN from those days don’t know about http/1.0 for ALPN.

all libcurl man pages examples compile cleanly

All (almost five hundred) libcurl man pages have EXAMPLE sections showing their use. Starting now, all of those sections are test-compiled as part of the CI builds to catch mistakes.

xCurl

It is often said that Imitation is the Sincerest Form of Flattery.

Also, remember libcrurl? That was the name of the thing Google once planned to do: reimplement the libcurl API on top of their Chrome networking library. Flattery or not, it never went anywhere.

The other day I received an email asking me about details regarding something called xCurl. Not having a clue what that was, a quick search soon had me enlightened.

xCurl is, using their own words, a Microsoft Game Development Kit (GDK) compliant implementation of the libCurl API.

A Frankencurl

The article I link to above describes how xCurl differs from libcurl:

xCurl differs from libCurl in that xCurl is implemented on top of WinHttp and automatically follows all the extra Microsoft Game Development Kit (GDK) requirements and best practices. While libCurl itself doesn’t meet the security requirements for the Microsoft Game Development Kit (GDK) console platform, use xCurl to maintain your same libCurl HTTP implementation across all platforms while changing only one header include and library linkage.

I don’t know anything about WinHttp, but since that is an HTTP API I can only presume that making libcurl use that instead of plain old sockets has to mean a pretty large surgery and code change. I also have no idea what the mentioned “security requirements” might be. I’m not very familiar with Windows internals nor with their game development setups.

The article then goes on to describe with some detail exactly which libcurl options that work, and which don’t and what libcurl build options that were used when xCurl was built. No DoH, no proxy support, no cookies etc.

The provided functionality is certainly a very stripped down and limited version of the libcurl API. A fun detail is that they quite bluntly just link to the libcurl API documentation to describe how xCurl works. It is easy and convenient of course, and it will certainly make xCurl “forced” to stick to the libcurl behavior

With large invasive changes of this kind we can certainly hope that the team making it has invested time and spent serious effort on additional testing, both before release and ongoing.

Source code?

I have not been able to figure out how to download xCurl in any form, and since I can’t find the source code I cannot really get a grip of exactly how much and how invasive Microsoft has patched this. They have not been in touch or communicated about this work of theirs to anyone in the curl project.

Therefore, I also cannot say which libcurl version this is based on – as there is no telling of that on the page describing xCurl.

The email that triggered me to crawl down this rabbit hole included a copyright file that seems to originate from an xCurl package, and that includes the curl license. The curl license file has the specific detail that it shows the copyright year range at the top and this file said

Copyright (c) 1996 - 2020, Daniel Stenberg, daniel@haxx.se, and many contributors, see the THANKS file.

It might indicate that they use a libcurl from a few years back. Only might, because it is quite common among users of libcurl to “forget” (sometimes I’m sure on purpose) to update this copyright range even when they otherwise upgrade the source code. This makes the year range a rather weak evidence of the actual age of the libcurl code this is based on.

Updates

curl (including libcurl) ships a new version at least once every eight weeks. We merge bugfixes at a rate of around three bugfixes per day. Keeping a heavily modified libcurl version in sync with the latest curl releases is hard work.

Of course, since they deliberately limit the scope of the functionality of their clone, lots of upstream changes in curl will not affect xCurl users.

License

curl is licensed under… the curl license! It is an MIT license that I was unclever enough to slightly modify many years ago. The changes are enough for organizations such as SPDX to consider it a separate one: curl. I normally still say that curl is MIT licensed because the changes are minuscule and do not change the spirit of the license.

The curl license of course allows Microsoft or anyone else do to this kind of stunt and they don’t even have to provide the source code for their changes or the final product and they don’t have to ask or tell anyone:

Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

I once picked this license for curl exactly because it allows this. Sure it might sometimes then make people do things in secret that they never contribute back, and we miss out on possible improvements because of that, but I think the more important property is that no company feels scared or restricted as to when and where they can use this code. A license designed for maximum adoption.

I have always had the belief that it is our relentless update scheme and never-ending flood of bugfixes that is what will keep users wanting to use the real thing and avoid maintaining long-running external patches. There will of course always be exceptions to that.

Follow-up

Forensics done by users who installed this indicate that this xCurl is based on libcurl 7.69..x. We removed a define from the headers in 7.70.0 (CURL_VERSION_ESNI) that this package still has. It also has the CURLOPT_MAIL_RCPT_ALLLOWFAILS define, added in 7.69.0.

curl 7.69.1 was released on March 11, 2020. It has 40 known vulnerabilities, and we have logged 3,566 bugfixes since then. Of course not all of any of those affect xCurl.

“you have hacked into my devices”

I’ve shown you email examples many times before. Today I received this. I don’t know this person. Clearly a troubled individual. I suspect she found my name and address somewhere and then managed to put me somewhere in the middle of the conspiracy against her.

The entire mail is written in a single paragraph and the typos are saved as they were written. It is a little hard to penetrate, but here it is:

From: Lindsay

Thank you for making it so easy for me to see that you have hacked into 3 of my very own devices throughout the year. I’m going to be holding onto all of my finds that have your name all over it and not by me because I have absolutely no reason to hack my own belongings. I will be adding this to stuff I have already for my attorney. You won’t find anything on my brand new tablet that you all have so kindly broken into and have violated my rights but have put much stress on myself as well. Maybe if you would have came and talked to me instead of hacking everything I own and fallow me to the point of a panic attack because I suffer from PTSD I might have helped you. I cannot help what my boyfriend does and doesn’t do but one ting I was told by the bank is that they would not let me talk for him so I can’t get involved. He has had his car up for pickup for months but I’m guessing that the reason they won’t pick the car up in the street right where it has sat for months waiting is because I’ve probably see every single driver that has or had fallowed me. My stress is so terrible that when I tell him to call the bank over and over again he does and doesn’t get anywhere and because of my stress over this he gets mad and beats me or choaks me. I have no where to go at the moment and I’m not going to sleep on the streets either. So if you can kindly tell the repo truck to pick up the black suv at his dad’s house in the street the bank can give them the number it would be great so I no longer have to deal with people thinking that they know the whole story. But really I am suffering horrificly. I’m not a mean person but imagine not knowing anything about what’s going on with your spouse and then finding out they didn’t pay the car payment and so being embarrassed about it try to pay for it yourself and they say no I have it only to find out that he did it for a second time and his dad actually was supposed to pay the entire thing off but instead he went down hill really fast and seeing the same exact people every day everywhere you go and you tell your spouse and they don’t believe you and start calling you wicked names like mine has and then from there every time my ptsd got worse from it happening over and over again and he says you’re a liar and he’s indenial about it and because I don’t agree with him so I get punched I get choked and now an broken with absolutely no one but God on my side.how would you feel if it was being done to you and people following you and your so angry that alls you do is yell at people anymore and come off as a mean person when I am not? I don’t own his car that he surrendered I don’t pay his bills he told me to drive it and that’s it.i trusted a liar and an abuser. I need someone to help other than my mom my attorney and eventually the news if everyone wants to be cruel to me I’m going to the news for people taking my pictures stalking me naibors across the street watching and on each side of the house and the school behind. It isn’t at all what you all think it is I want someone to help get the suv picked up not stalked. How would you feel if 5 cities were watching every single move? I am the victim all the way around and not one nabor has ever really taken the time to get to know me. I’m not at all a mean person but this is not my weight to carry. I have everyone on camera and I will have street footage pulled and from each store or gestation I go to. I don’t go anywhere anymore from this and I’m the one asking for help. Their was one guy who was trying to help me get in touch with the tow truck guy and I haven’t seen him since and his name is Antonio. He was going to help me. I have been trying to to the right thing from the start and yet you all took pleasure in doing rotten mean things to me and laughing about it. I want one person to come help me since I can’t talk with the bank to get his suv picked up and I won’t press charges on the person that helps nor onthe tow truck guy either.

I have not replied.

URL parser performance

URLs is a dear subject of mine on this blog, as readers might have noticed.

“URL” is this mythical concept of a string that identifies a resource online and yet there is no established standard for its syntax. There are instead multiple ones out of which one is on purpose “moving” so it never actually makes up its mind but instead keeps changing.

This then leads to there being basically no two URL parsers that treat URLs the same, to the extent that mixing parsers is considered a security risk.

The standards

The browsers have established their WHATWG URL Specification as a “living document”, saying how browsers should parse URLs, gradually taking steps away from the earlier established RFC 3986 and RFC 3987 attempts.

The WHATWG standard keeps changing and the world that tries to stick to RFC 3986 still needs to sometimes get and adapt to WHATWG influences in order to interoperate with the browser-centric part of the web. It leaves URL parsing and “URL syntax” everywhere in a sorry state.

curl

In the curl project we decided in 2018 to help mitigate the mixed URL parser problem by adding a URL parser API so that applications that use libcurl can use the same parser for all its URL parser needs and thus avoid the dangerous mixing part.

The libcurl API for this purpose is designed to let users parse URLs, to extract individual components, to set/change individual components and finally to extract a normalized URL if wanted. Including some URL encoding/decoding and IDN support.

trurl

Thanks to the availability and functionality of the public libcurl URL API, we could build and ship the separate trurl tool earlier this year.

Ada

Some time ago I was made aware of an effort to (primarily) write a new URL parser for node js – although the parser is stand-alone and can be used by anyone else who wants to: The Ada URL Parser. The two primary developers behind this effort, Yagiz Nizipli and Daniel Lemire figured out that node does a large amount of URL parsing so by speeding up this parser alone it would apparently have a general performance impact.

Ada is C++ project designed to parse WHATWG URLs and the first time I was in contact with Yagiz he of course mentioned how much faster their parser is compared to curl’s.

You can also see them reproduce and talk about these numbers on this node js conference presentation.

Benchmarks

Everyone who ever tried to write code faster than some other code has found themselves in a position where they need to compare. To benchmark one code set against the other. Benchmarking is an art that is close to statistics and marketing: very hard to do without letting your own biases or presumptions affect the outcome.

Speed vs the rest

After I first spoke with Yagiz, I did go back to the libcurl code to see what obvious mistakes I had done and what low hanging fruit there was to pick in order to speed things up a little. I found a few flaws that maybe did a minor difference, but in my view there are several other properties of the API that is actually more important than sheer speed:

  • non-breaking API and ABI
  • readable and maintainable code
  • sensible and consistent API
  • error codes that help users understand what the problem is

Of course, there is also the thing that if you first figure out how to parse a URL the fastest way, maybe you can work out a smoother API that works better with that parsing approach. That’s not how I went about when creating the libcurl API.

If we can maintain those properties mentioned above, I still want the parser to run as fast as possible. There is no point in being slower than necessary.

URLs vs URLs

Ada parses WHATWG URLs and libcurl parses RFC 3986 URLs. They parse URLs differently and provide different feature sets. They are not interchangeable.

In Ada’s benchmarks they have ignored the parser differences. Throw the parsers against each other, and according to all their public data since early 2023 their parser is 7-8 times faster.

700% faster really?

So how on earth can you make such a simple thing as URL parsing 700% faster? It never sat right with me when they claimed those numbers but since I had not compared them myself I trusted them. After all, they should be fairly easy to compare and they seemed clueful enough.

Until recently when I decided to reproduce their claims and see how much their numbers depends on their specific choices of URLs to parse. It taught me something.

Reproduce the numbers

In my tests, their parser is fast. It is clearly faster than the libcurl parser, and I too of course ignored the parser results since they would not be comparable anyway.

In my tests on my development machine, Ada is 1.25 – 1.8 times faster than libcurl. There is no doubt Ada is faster, just far away from the enormous difference they claim. How come?

  1. You use the input data that most favorably shows a difference
  2. You run the benchmark on a hardware for which your parser has magic hardware acceleration

I run a decently modern 13th gen Intel Core-I7 i7-13700K CPU in my development machine. It’s really fast, especially on single-thread stuff like this. On my machine, the Ada parser can parse more URLs/second than even the Ada people themselves claim, which just tells us they used slower machines to test on. Nothing wrong with that.

The Ada parser has code that is using platform specific instructions on some environments and the benchmark they decide to use when boasting about their parser was done on such a platform. An Apple m1 CPU to be specific. In most aspects except performance per watt, not a speed monster CPU.

In itself this is not wrong, but maybe a little misleading as this is far from clearly communicated.

I have a script, urlgen. that generates URLs in as many combinations as possible so that the parser’s every corner and angle are suitably exercised and verified. Many of those combination therefor illegal in subtle ways. This is the set of URLs I have thrown at the curl parser mostly, which then also might explain why this test data is the set that makes Ada least favorable (at 1.26 x the libcurl speed). Again: their parser is faster, no doubt. I have not found a test case that does not show it running faster than libcurl’s parser.

A small part of the explanation of how they are faster is of course that they do not provide the result, the individual components, in their own separately allocated strings.

Here’s a separate detailed document how I compared.

More mistakes

They also repeatably insist curl does not handle International Domain Names (IDN) correctly, which I simply cannot understand and I have not got any explanation for. curl has handled IDN since 2004. I’m guessing a mistake, an old bug or that they used a curl build without IDN support.

Size

I would think a primary argument against using Ada vs libcurl’s parser is its size and code. Not that I believe that there are many situations where users are actually selecting between these two.

Ada header and source files are 22,774 lines of C++

libcurl URL API header and source files are 2,103 lines of C.

Comparing the code sizes like this is a little unfair since Ada has its own IDN management code included, which libcurl does not, and that part comes with several huge tables and more.

Improving libcurl?

I am sure there is more that can be done to speed up the libcurl URL parser, but there is also the case of diminishing returns. I think it is pretty fast already. On Ada’s test case using 100K URLs from Wikipedia, libcurl parses them at an average of 178 nanoseconds per URL on my machine. More than 5.6 million real world URLs parsed per second per core.

This, while also storing each URL component in a separate allocation after each parse, and also returning an error code that helps identifying the problem if the URL fails to parse. With an established and well-documented API that has been working since 2018 .

The hardware specific magic Ada uses can possibly be used by libcurl too. Maybe someone can try that out one day.

I think we have other areas in libcurl where work and effort are better spent right now.

curl on 100 operating systems

In a recent pull-request for curl, I clarified to the contributor that their change would only be accepted and merged into curl’s git code repository if they made sure that the change was done in a way so that it did not break (testing) for and on legacy platforms.

In that thread, I could almost feel how the contributor squirmed as this requirement made their work harder. Not by much, but harder no less.

I insisted that since curl at that point (and still does) already supports 32 bit time_t types, changes in this area should maintain that functionality. Even if 32 bit time_t is of limited use already and will be even more limited as we rush toward the year 2038. Quite a large number of legacy platforms are still stuck on the 32 bit version.

Why do I care so much about old legacy crap?

Nobody asked me exactly that using those words. I am paraphrasing what I suspect some contributors think at times when I ask them to do additional changes to pull requests. To make their changes complete.

It is not so much about the legacy systems. It is much more about sticking to our promises and not breaking things if we don’t have to.

Partly stability and promises

In the curl project we work relentlessly to maintain ABI and API stability and compatibility. You can upgrade your libcurl using application from the mid 2000s to the latest libcurl – without recompiling the application – and it still works the same. You can run your unmodified scripts you wrote in the early 2000s with the latest curl release today – and it is almost guaranteed that it works exactly the same way as it did back then.

This is more than a party trick and a snappy line to use in the sales brochures.

This is the very core of curl and libcurl and a foundational principle of what we ship: you can trust us. You can lean on us. Your application’s Internet transfer needs are in safe hands and you can be sure that even if we occasionally ship bugs, we provide updates that you can switch over to without the normal kinds of upgrade pains software so often comes with. In a never-ending fashion.

Also of course. Why break something that is already working fine?

Partly user numbers don’t matter

Users do matter, but what I mean in this subtitle is that the number of users on a particular platform is rarely a reason or motivator for working on supporting it and making things work there. That is not how things tend to work.

What matters is who is doing the work and if the work is getting done. If we have contributors around that keep making sure curl works on a certain platform, then curl will keep running on that platform even if they are said to have very few users. Those users don’t maintain the curl code. Maintainers do.

A platform does not truly die in curl land until necessary code for it is no longer maintained – and in many cases the unmaintained code can remain functional for years. It might also take a long time until we actually find out that curl no longer works on a particular platform.

On the opposite side it can be hard to maintain a platform even if it has large amount of users if there are not enough maintainers around who are willing and knowledgeable to work on issues specific to that platform.

Partly this is how curl can be everywhere

Precisely because we keep this strong focus on building, working and running everywhere, even sometimes with rather funny and weird configurations, is an explanation to how curl and libcurl has ended up in so many different operating systems, run on so many CPU architectures and is installed in so many things. We make sure it builds and runs. And keeps doing so.

And really. Countless users and companies insist on sticking to ancient, niche or legacy platforms and there is nothing we can do about that. If we don’t have to break functionality for them, having them stick to relying on curl for transfers is oftentimes much better security-wise than almost all other (often homegrown) alternatives.

We still deprecate things

In spite of the fancy words I just used above, we do remove support for things every now and then in curl. Mostly in the terms of dropping support for specific 3rd party libraries as they dwindle away and fall off like leaves in the fall, but also in other areas.

The key is to deprecate things slowly, with care and with an open communication. This ensures that everyone (who wants to know) is aware that it is happening and can prepare, or object if the proposal seems unreasonable.

If no user can detect a changed behavior, then it is not changed.

curl is made for its users. If users want it to keep doing something, then it shall do so.

The world changes

Internet protocols and versions come and go over time.

If you bring up your curl command lines from 2002, most of them probably fail to work. Not because of curl, but because the host names and the URLs used back then no longer work.

A huge reason why a curl command line written in 2002 will not work today exactly as it was written back then is the transition from HTTP to HTTPS that has happened since then. If the site actually used TLS (or SSL) back in 2002 (which certainly was not the norm), it used a TLS protocol version that nowadays is deemed insecure and modern TLS libraries (and curl) will refuse to connect to it if it has not been updated.

That is also the reason that if you actually have a saved curl executable from 2002 somewhere and manage to run that today, it will fail to connect to modern HTTPS sites. Because of changes in the transport protocol layers, not because of changes in curl.

Credits

Top image by Sepp from Pixabay

Discussion

Hacker news

curl, open source and networking