The libcurl transfer state machine

I’ve worked hard on making the presentation I ended up calling libcurl under the hood. A part of that presentation is spent on explaining the main libcurl transfer state machine and here I’ll try to document some of what, in a written form. Understanding the main transfer state machine in libcurl could be valuable and interesting for anyone who wants to work on libcurl internals and maybe improve it.

Background

The state is kept in easy handle in the struct field called mstate. The source file for this state machine is called multi.c.

An easy handle is always in exactly one of these states for as long as it exists.

This transfer state machine is designed to work for all protocols libcurl supports, but basically no protocol will transition through all states. As you can see in the drawing, there are many different possible transitions from a lot of the states.

libcurl transfer state machine

(click the image for a larger version)

Start

A transfer starts up there above the surface in the INIT state. That’s a yellow box next to the little start button. Basically the boat shows how it goes from INIT to the right over to MSGSENT with it’s finish flag, but the real path is all done under the surface.

The yellow boxes (states) are the ones that exist before or when a connection is setup. The striped background is for all states that has a single and specific connectdata struct associated with the transfer.

CONNECT

If there’s a connection limit, either in total or per host etc, the transfer can get sent to the PENDING state to wait for conditions to change. If not, the state probably moves on to one of the blue ones to resolve host name and connect to the server etc. If a connection could be reused, it can shortcut immediately over to the green DO state.

The green states are all about setting up the connection to a state of fully connected, authenticated and logged in. Ready to send the first request.

DO

The green DO states are all about sending the request with one or more commands so that the file transfer can begin. There are several such states to properly support all protocols but also for historical reasons. We could probably remove a state there by some clever reorgs if we wanted.

PERFORMING

When a request has been issued and the transfer starts, it transitions over to PERFORMING. In the white states data is flowing. Potentially a lot. Potentially in both or either direction. If during the transfer curl finds out that the transfer is faster than allowed, it will move into RATELIMITING until it has cooled down a bit.

DONE

All the post-transfer states are red in the picture. The DONE is the first of them and after having done what it needs to round up the transfer, it disassociates with the connection and moves to COMPLETED. There’s no stripes behind that state. Disassociate here means that the connection is returned back to the connection pool for later reuse, or in the worst case if deemed that it can’t be reused or if the application has instructed it so, closed.

As you’ll note, there’s no disconnect anywhere in the state machine. This is simply because the disconnect is not really a part of the transfer at all.

COMPLETED

This is the end of the road. In this state a message will be created and put in the outgoing queue for the application to read, and then as a final last step it moves over to MSGSENT where nothing more happens.

A typical handle remains in this state until the transfer is reused and restarted, in which it will be set back to the INIT state again and the journey begins again. Possibly with other transfer parameters and URL this time. Or perhaps not.

State machines within each state

What this state diagram and explanation doesn’t show is of course that in each of these states, there can be protocol specific handling and each of those functions might in themselves of course have their own state machines to control what to do and how to handle the protocol details.

Each protocol in libcurl has its own “protocol handler” and most of the protocol specific stuff in libcurl is then done by calls from the generic parts to the protocol specific parts with calls like protocol_handler->proto_connect() that calls the protocol specific connection procedure.

This allows the generic state machine described in this blog post to not really know the protocol specifics and yet all the currently support 26 transfer protocols can be supported.

libcurl under the hood – the video

Here’s the full video of libcurl under the hood.

If you want to skip directly to the state machine diagram and the following explanation, go here.

Credits

Image by doria150 from Pixabay

curl up 2021

curl up 2021 happened today.

We had five presentations done, all prerecorded and made available before the event. At the Sunday afternoon we gathered to discuss the presentations and everything around those topics.

The presentations

  1. The state of curl 2021 – Daniel Stenberg
  2. curl security 2021 – Daniel Stenberg
  3. libcurl under the hood – Daniel Stenberg
  4. Interfacing rust – Stefan Eissing
  5. Curl profiling – Jim Fuller.

Discussions

We were not very many who actually joined the meeting, and out of the people in the meeting a majority decided to be spectators only and remained muted with their cameras off.

It turned out as a two hour long mostly casual talk among me, Stefan Eissing and Emil Engler about the presentations and related topics. Toward the end, Kamil Dudka appeared.

The three of us get to talk about roadmap items, tests, security, writing code that interfaces modules written in rust and what more details in the libcurl internals that could use further descriptions and documentation.

The video

The agenda in the video is roughly following the agenda order in the 2021 wiki page and the discussion topics mentioned there.

Sponsored

Thanks to wolfSSL for sponsoring the video meeting account used!

curl pictures

“Memes” or other fun images involving curl. Please send or direct me to other ones you think belong in this collection! Kept here solely to boost my ego.

All modern digital infrastructure

This is the famous xkcd strip number 2347, modified to say Sweden and 1997 by @tsjost. I’ve seen this picture taking some “extra rounds” in various places, somehow also being claimed to be xkcd 2347 when people haven’t paid attention to the “patch” in the text.

Entire web infrastructure

Image by @matthiasendler

Car contract

This photo of a rental car contract with an error message on the printed paper was given to me by a good person I’ve unfortunately lost track of.

The developer dice

Thanks to Cassidy. (For purchase here.)

Don’t use -X

Remember that using curl -X is very often just the wrong thing to do. Jonas Forsberg helps us remember:

The curl

In an email from NASA that I received and shared, the person asked about details for “the curl”.

Image by eichkat3r at mastodon.

You’re sure this is safe?

Piping curl output straight into a shell is a much debated practice…

Picture by Tim Chase.

curl, reinvented by…

Remember the powershell curl alias?

Picture by Shashimal Senarath.

Billboard

This is an old classic:

curl please

Related

Screenshotted curl credits.

curl is just the hobby

Every base is base 10

This image originally comes from cowbirdsinlove.com but sadly it seems the page that once showed it is no longer there. I saved it from that site already back in 2015, but I cannot recall the exact URL it used. The image is still available at https://cowbirdsinlove.com/comics/base10[1].png.

Since I consider this picture such an iconic classic and masterpiece, I decided I better host it here in a small attempt to preserve it for everyone to enjoy.

Because, you know, every base is base 10.

Update: the original page on archive.org.

fixed vulnerabilities were once created

In the curl project we make great efforts to store a lot of meta data about each and every vulnerability that we have fixed over the years – and curl is over 23 years old. This data set includes CVE id, first vulnerable version, last vulnerable version, name, announce date, report to the project date, CWE, reward amount, code area and “C mistake kind”.

We also keep detailed data about releases, making it easy to look up for example release dates for specific versions.

Dashboard

All this, combined with my fascination (some would call it obsession) of graphs is what pushed me into creating the curl project dashboard, with an ever-growing number of daily updated graphs showing various data about the curl projects in visual ways. (All scripts for that are of course also freely available.)

What to show is interesting but of course it is sometimes even more important how to show particular data. I don’t want the graphs just to show off the project. I want the graphs to help us view the data and make it possible for us to draw conclusions based on what the data tells us.

Vulnerabilities

The worst bugs possible in a project are the ones that are found to be security vulnerabilities. Those are the kind we want to work really hard to never introduce – but we basically cannot reach that point. This special status makes us focus a lot on these particular flaws and we of course treat them special.

For a while we’ve had two particular vulnerability graphs in the dashboard. One showed the number of fixed issues over time and another one showed how long each reported vulnerability had existed in released source code until a fix for it shipped.

CVE age in code until report

The CVE age in code until report graph shows that in general, reported vulnerabilities were introduced into the code base many years before they are found and fixed. In fact, the all time average time suggests they are present for more than 2,700 – more than seven years. Looking at the reports from the last 12 months, the average is even almost 1000 days more!

It takes a very long time for vulnerabilities to get found and reported.

When were the vulnerabilities introduced

Just the other day it struck me that even though I had a lot of graphs already showing in the dashboard, there was none that actually showed me in any nice way at what dates we created the vulnerabilities we spent so much time and effort hunting down, documenting and talking about.

I decided to use the meta data we already have and add a second plot line to the already existing graph. Now we have the previous line (shown in green) that shows the number of fixed vulnerabilities bumped at the date when a fix was released.

Added is the new line (in red) that instead is bumped for every date we know a vulnerability was first shipped in a release. We know the version number from the vulnerability meta data, we know the release date of that version from the release meta data.

This all new graph helps us see that out of the current 100 reported vulnerabilities, half of them were introduced into the code before 2010.

Using this graph it also very clear to me that the increased CVE reporting that we can spot in the green line started to accelerate in the project in 2016 was not because the bugs were introduced then. The creation of vulnerabilities rather seem to be fairly evenly distributed over time – with occasional bumps but I think that’s more related to those being particular releases that introduced a larger amount of features and code.

As the average vulnerability takes 2700 days to get reported, it could indicate that flaws landed since 2014 are too young to have gotten reported yet. Or it could mean that we’ve improved over time so that new code is better than old and thus when we find flaws, they’re more likely to be in old code paths… I don’t think the red graph suggests any particular notable improvement over time though. Possibly it does if we take into account the massive code growth we’ve also had over this time.

The green “fixed” line at least has a much better trend and growth angle.

Present in which releases

As we have the range of vulnerable releases stored in the meta data file for each CVE, we can then add up the number of the flaws that are present in every past release.

Together with the release dates of the versions, we can make a graph that shows the number of reported vulnerabilities that are present in each past release over time, in a graph.

You can see that some labels end up overwriting each other somewhat for the occasions when we’ve done two releases very close in time.

curl security 2021

Please select your TLS

tldr: starting now, you need to select which TLS to use when you run curl’s configure script.

How it started

In June 1998, three months after the first release of curl, we added support for HTTPS. We decided that we would use an external library for this purpose – for providing SSL support – and therefore the first dependency was added. The build would optionally use SSLeay. If you wanted HTTPS support enabled, we would use that external library.

SSLeay ended development at the end of that same year, and OpenSSL rose as a new project and library from its ashes. Of course, even later the term “SSL” would also be replaced by “TLS” but the entire world has kept using them interchangeably.

Building curl

The initial configure script we wrote and provided back then (it appeared for the first time in November 1998) would look for OpenSSL and use it if found present.

In the spring of 2005, we merged support for an alternative TLS library, GnuTLS, and now you would have to tell the configure script to not use OpenSSL but instead use GnuTLS if you wanted that in your build. That was the humble beginning of the explosion of TLS libraries supported by curl.

As time went on we added support for more and more TLS libraries, giving the users the choice to select exactly which particular one they wanted their curl build to use. At the time of this writing, we support 14 different TLS libraries.

TLS backends supported in curl, over time

OpenSSL was still default

The original logic from when we added GnuTLS back in 2005 was however still kept so whatever library you wanted to use, you would have to tell configure to not use OpenSSL and instead use your preferred library.

Also, as the default configure script would try to find and use OpenSSL it would result in some surprises to users who maybe didn’t want TLS in their build or even when something was just not correctly setup and configure unexpectedly didn’t find OpenSSL and the build then went on and was made completely without TLS support! Sometimes even without getting noticed for a long time.

Not doing it anymore

Starting now, curl’s configure will not select TLS backend by default.

It will not decide for you which one you use, as there are many decisions involved when selecting TLS backend and there are many users who prefer something else than OpenSSL. We will no longer give any special treatment to that library at build time. We will not impose our bias onto others anymore.

Not selecting any TLS backend at all will just make configure exit quickly with a help message prompting you to make a decision, as shown below. Notice that going completely without a TLS library is still fine but similarly also requires an active decision (--without-ssl).

The list of available TLS backends is sorted alphabetically.

Effect on configure users

With this change, every configure invoke needs to clearly state which TLS library or even libraries (in plural since curl supports building with support for more than one library) to use.

The biggest change is of course for everyone who invokes configure and wants to build with OpenSSL since they now need to add an option and say that explicitly. For virtually everyone else life can just go on like before.

Everyone who builds curl automatically from source code might need to update their build scripts.

The first release shipping with this change will be curl 7.77.0.

Credits

Image by Free-Photos from Pixabay

“So what exactly is curl?”

You know that question you can get asked casually by a person you’ve never met before or even by someone you’ve known for a long time but haven’t really talked to about this before. Perhaps at a social event. Perhaps at a family dinner.

– So what do you do?

The implication is of course what you work with. Or as. Perhaps a title.

Software Engineer

In my case I typically start out by saying I’m a software engineer. (And no, I don’t use a title.)

If the person who asked the question is a non-techie, this can then take off in basically any direction. From questions about the Internet, how their printer acts up sometimes to finicky details about Wifi installations or their parents’ problems to install anti-virus. In other words: into areas that have virtually nothing to do with software engineering but is related to computers.

If the person is somewhat knowledgeable or interested in technology or computers they know both what software and engineering are. Then the question can get deepened.

What kind of software?

Alternatively they ask for what company I work for, but it usually ends up on the same point anyway, just via this extra step.

I work on curl. (Saying I work for wolfSSL rarely helps.)

Business cards of mine

So what is curl?

curl is a command line tool used but a small set of people (possibly several thousands or even millions), and the library libcurl that is installed in billions of places.

I often try to compare libcurl with how companies build for example cars out of many components from different manufacturers and companies. They use different pieces from many separate sources put together into a single machine to produce the end product.

libcurl is like one of those little components that a car manufacturer needs. It isn’t the only choice, but it is a well known, well tested and familiar one. It’s a safe choice.

Internet what?

Lots of people, even many with experience, knowledge or even jobs in the IT industry I’ve realized don’t know what an Internet transfer is. Me describing curl as doing such, doesn’t really help in those cases.

An internet transfer is the bridge between “the cloud” and your devices or applications. curl is a bridge.

Everything wants Internet these days

In general, anything today that has power goes towards becoming networked. Everything that can, will connect to the Internet sooner or later. Maybe not always because it’s a good idea, but because it gives your thing a (perceived) advantage to your competitors.

Things that a while ago you wouldn’t dream would do that, now do Internet transfers. Tooth brushes, ovens, washing machines etc.

If you want to build a new device or application today and you want it to be successful and more popular than your competitors, you will probably have to make it Internet-connected.

You need a “bridge”.

Making things today is like doing a puzzle

Everyone who makes devices or applications today have a wide variety of different components and pieces of the big “puzzle” to select from.

You can opt to write many pieces yourself, but virtually nobody today creates anything digital entirely on their own. We lean on others. We stand on other’s shoulders. In particular open source software has grown up to become or maybe provide a vast ocean of puzzle pieces to use and leverage.

One of the little pieces in your device puzzle is probably Internet transfers, because you want your thing to get updates, upload telemetry and who knows what else.

The picture then needs a piece inserted in the right spot to get complete. The Internet transfers piece. That piece can be curl. We’ve made curl to be a good such piece.

This perfect picture is just missing one little piece…

Relying on pieces provided by others

Lots have been said about the fact that companies, organizations and entire ecosystems rely on pieces and components written, maintained and provided by someone else. Some of them are open source components written by developers on their spare time, but are still used by thousands of companies shipping commercial products.

curl is one such component. It’s not “just” a spare time project anymore of course, but the point remains. We estimate that curl runs in some ten billion installations these days, so quite a lot of current Internet infrastructure uses our little puzzle piece in their pictures.

Modified version of the original xkcd 2347 comic

So you’re rich

I rarely get to this point in any conversation because I would have already bored my company into a coma by now.

The concept of giving away a component like this as open source under a liberal license is a very weird concept to general people. Maybe also because I say that I work on this and I created it, but I’m not at all the only contributor and we wouldn’t have gotten to this point without the help of several hundred other developers.

“- No, I give it away for free. Yes really, entirely and totally free for anyone and everyone to use. Correct, even the largest and richest mega-corporations of the world.”

The ten billion installations work as marketing for getting companies to understand that curl is a solid puzzle piece so that more will use it and some of those will end up discovering they need help or assistance and they purchase support for curl from me!

I’m not rich, but I do perfectly fine. I consider myself very lucky and fortunate who get to work on curl for a living.

A curl world

There are about 5 billion Internet using humans in the world. There are about 10 billion curl installations.

The puzzle piece curl is there in the middle.

This is how they’re connected. This is the curl world map 2021.

Or put briefly

libcurl is a library for doing transfers specified with a URL, using one of the supported protocols. It is fast, reliable, very portable, well documented and feature rich. A de-facto standard API available for everyone.

Credits

The original island image is by Julius Silver from Pixabay. xkcd strip edits were done by @tsjost.

Mars 2020 Helicopter Contributor

Friends of mine know that I’ve tried for a long time to get confirmation that curl is used in space. We’ve believed it to be likely but I’ve wanted to get a clear confirmation that this is indeed the fact.

Today GitHub posted their article about open source in the Mars mission, and they now provide a badge on their site for contributors of projects that are used in that mission.

I have one of those badges now. Only a few other of the current 879 recorded curl authors got it. Which seems to be due to them using a very old curl release (curl 7.19, released in September 2008) and they couldn’t match all contributors with emails or the authors didn’t have their emails verified on GitHub etc.

According to that GitHub blog post, we are “almost 12,000” developers who got it.

While this strictly speaking doesn’t say that curl is actually used in space, I think it can probably be assumed to be.

Here’s the interplanetary curl development displayed in a single graph:

See also: screenshotted curl credits and curl supports NASA.

Credits

Image by Aynur Zakirov from Pixabay

curl those funny IPv4 addresses

Everyone knows that on most systems you can specify IPv4 addresses just 4 decimal numbers separated with periods (dots). Example:

192.168.0.1

Useful when you for example want to ping your local wifi router and similar. “ping 192.168.0.1”

Other bases

The IPv4 string is usually parsed by the inet_addr() function or at times it is passed straight into the name resolver function like getaddrinfo().

This address parser supports more ways to specify the address. You can for example specify each number using either octal or hexadecimal.

Write the numbers with zero-prefixes to have them interpreted as octal numbers:

0300.0250.0.01

Write them with 0x-prefixes to specify them in hexadecimal:

0xc0.0xa8.0x00.0x01

You will find that ping can deal with all of these.

As a 32 bit number

An IPv4 address is a 32 bit number that when written as 4 separate numbers are split in 4 parts with 8 bits represented in each number. Each separate number in “a.b.c.d” is 8 bits that combined make up the whole 32 bits. Sometimes the four parts are called quads.

The typical IPv4 address parser however handles more ways than just the 4-way split. It can also deal with the address when specified as one, two or three numbers (separated with dots unless its just one).

If given as a single number, it treats it as a single unsigned 32 bit number. The top-most eight bits stores what we “normally” with write as the first number and so on. The address shown above, if we keep it as hexadecimal would then become:

0xc0a80001

And you can of course write it in octal as well:

030052000001

and plain old decimal:

3232235521

As two numbers

If you instead write the IP address as two numbers with a dot in between, the first number is assumed to be 8 bits and the next one a 24 bit one. And you can keep on mixing the bases as you see like. The same address again, now in a hexadecimal + octal combo:

0xc0.052000001

This allows for some fun shortcuts when the 24 bit number contains a lot of zeroes. Like you can shorten “127.0.0.1” to just “127.1” and it still works and is perfectly legal.

As three numbers

Now the parts are supposed to be split up in bits like this: 8.8.16. Here’s the example address again in octal, hex and decimal:

0xc0.0250.1

Bypassing filters

All of these versions shown above work with most tools that accept IPv4 addresses and sometimes you can bypass filters and protection systems by switching to another format so that you don’t match the filters. It has previously caused problems in node and perl packages and I’m guessing numerous others. It’s a feature that is often forgotten, ignored or just not known.

It begs the question why this very liberal support was once added and allowed but I’ve not been able to figure that out – maybe because of how it matches class A/B/C networks. The support for this syntax seems to have been introduced with the inet_aton() function in the 4.2BSD release in 1983.

IPv4 in URLs

URLs have a host name in them and it can be specified as an IPv4 address.

RFC 3986

The RFC 3986 URL specification’s section 3.2.2 says an IPv4 address must be specified as:

dec-octet "." dec-octet "." dec-octet "." dec-octet

… but in reality very few clients that accept such URLs actually restrict the addresses to that format. I believe mostly because many programs will pass on the host name to a name resolving function that itself will handle the other formats.

The WHATWG URL Spec

The Host Parsing section of this spec allows the many variations of IPv4 addresses. (If you’re anything like me, you might need to read that spec section about three times or so before that’s clear).

Since the browsers all obey to this spec there’s no surprise that browsers thus all allow this kind of IP numbers in URLs they handle.

curl before

curl has traditionally been in the camp that mostly accidentally somewhat supported the “flexible” IPv4 address formats. It did this because if you built curl to use the system resolver functions (which it does by default) those system functions will handle these formats for curl. If curl was built to use c-ares (which is one of curl’s optional name resolver backends), using such address formats just made the transfer fail.

The drawback with allowing the system resolver functions to deal with the formats is that curl itself then works with the original formatted host name so things like HTTPS server certificate verification and sending Host: headers in HTTP don’t really work the way you’d want.

curl now

Starting in curl 7.77.0 (since this commit ) curl will “natively” understand these IPv4 formats and normalize them itself.

There are several benefits of doing this ourselves:

  1. Applications using the URL API will get the normalized host name out.
  2. curl will work the same independently of selected name resolver backend
  3. HTTPS works fine even when the address is using other formats
  4. HTTP virtual hosts headers get the “correct” formatted host name

Fun example command line to see if it works:

curl -L 16843009

16843009 gets normalized to 1.1.1.1 which then gets used as http://1.1.1.1 (because curl will assume HTTP for this URL when no scheme is used) which returns a 301 redirect over to https://1.1.1.1/ which -L makes curl follow…

Credits

Image by Thank you for your support Donations welcome to support from Pixabay

curl 7.76.1 – h2 works again

I’m happy to once again present a new curl release to the world. This time we decided to cut the release cycle short and do a quick patch release only two weeks since the previous release. The primary reason was the rather annoying and embarrassing HTTP/2 bug. See below for all the details.

Release presentation

Numbers

the 199th release
0 changes
14 days (total: 8,426)

21 bug-fixes (total: 6,833)
30 commits (total: 27,008)
0 new public libcurl function (total: 85)
0 new curl_easy_setopt() option (total: 288)

0 new curl command line option (total: 240)
23 contributors, 10 new (total: 2,366)
14 authors, 6 new (total: 878)
0 security fixes (total: 100)
0 USD paid in Bug Bounties (total: 5,200 USD)

Bug-fixes

This was a very short cycle but we still managed to merge a few interesting fixes. Here are some:

HTTP/2 selection over HTTPS

This regression is the main reason for this patch release. I fixed an issue before 7.76.0 was released and due to lack of covering tests with other TLS backends, nobody noticed that my fix also break HTTP/2 selection over HTTPS when curl was built to use one GnuTLS, BearSSL, mbedTLS, NSS, SChannnel, Secure Transport or wolfSSL!

The problem I fixed for 7.76.0: I made sure that no internal code updates the HTTP version choice the user sets, but that it then updates only the internal “we want this version”. Without this fix, an application that reuses an easy handle could without specifically asking for it, get another HTTP version in subsequent requests if a previous transfer had been downgraded. Clearly the fix was only partial.

The new fix should make HTTP/2 work and make sure the “wanted version” is used correctly. Fingers crossed!

Progress meter final update in parallel mode

When doing small and quick transfers in parallel mode with the command line tool, the logic could make the last update call to get skipped!

file: support getting directories again

Another regression. A recent fix made curl not consider directories over FILE:// to show a size (if -I or -i is used). That did however also completely break “getting” such a directory…

HTTP proxy: only loop on 407 + close if we have credentials

When a HTTP(S) proxy returns a 407 response and closes the connection, curl would retry the request to it even if it had no credentials to use. If the proxy just consistently did the same 407 + close, curl would get stuck in a retry loop…

The fixed version now only retries the connection (with auth) if curl actually has credentials to use!

Next release cycle

The plan is to make the next cycle two weeks shorter, to get us back on the previously scheduled path. This means that if we open the feature window on Monday, it will be open for just a little over two weeks, then give us three weeks of only bug-fixes before we ship the next release on May 26.

The next one is expected to become 7.77.0. Due to the rather short feature window this coming cycle I also fear that we might not be able to merge all the new features that are waiting to get merged.

curl, open source and networking