Deviating from specs

tldr: we do not particular keep track nor document curl’s exact spec compliance. I cannot fathom how we could.

Today, in October 2022, curl and libcurl combined consist of nearly 150,000 lines of source code (not counting blank lines). 19% of those are comments.

This source code pile was carefully crafted with the sole purpose of performing Internet transfers using one or more of the 28 separate supported protocols. (There are 28 different supported URL schemes, it can be discussed if they are also 28 protocols or not.)

Which specs does curl use

It was recently proposed to me that we should document which RFCs curl adheres to and follows, and what deviants there are. In the name of helping the users understand what to expect from curl and educating the world how curl will behave.

This is indeed a noble idea and a worthy goal. We do not want to surprise users. We want them to know.

It was suggested that it might have a security impact if curl would deviate from a spec and if this is not documented clearly, users could be mislead.

What specs

curl speaks TCP/IP (and UDP or QUIC at times), it does DNS and DNS-over-HTTPS, it speaks over proxies and it speaks a range of various application protocols to perform what asked of it. There are literally hundreds of RFCs to read to catch up on all the details.

A while ago I collected the what I consider most important RFCs to read to figure out how curl works and why. That is right now 149 specification documents at a total of over 300,000 lines of text. (It was not done very scientifically.)

Counting the words in these 149 documents, they add up to a total of many more words than the entire Harry Potter series, and the Lord of the Rings series (including the Hobbit) is far behind together with War and Peace: 1.6 megawords.

Luckily, specs are mostly reference literature and we rarely have to read through them all to start our journey, but we often need to go back to check details.

Everything changes over time

The origins of curl trace back to late 1996 and it has been in constant development since then. curl, the Internet and the specifications have all changed significantly over these years.

The specifications that were around when we started have generally been updated multiple times, while we struggle to maintain behavior and functionality for our users. It is hard to spot and react to minor changes in specification updates. They might have been done to clarify a situation, but sometimes such a clarification ends up triggering a functionality change in our code.

Sometimes an update to a spec is even largely ignored by fellow protocol implementers out there in the world, and for the sake of interoperability we too then need to adjust our interpretations so that we work similarly to our peers.

Expectations from users change as values and terms are established in people’s minds rather than in specs. For example: what exactly is the “URL” you see in the browser’s top bar?

Over time, other tools and programs that also work on URLs and on the Internet, gradually change as they too development and slowly morph into the new beings we did not foresee decades ago. This change perceptions and expectations in the user base at large.

The always changing nature of the Internet creates interoperability challenges ever so often: out of the blue a team of protocol implementers can decide to interpret an existing term or a passage in a specification differently one day. When the whole world takes a turn like that, we are sometimes forced to follow along as that is then the new world view.

Another complication is also that curl uses (several) third party libraries for parts of its operations, and some of those details are of course also covered by RFCs.

Guidelines

Our primary guidelines when performing Internet transfers are:

  1. Follow established standard protocol specifications
  2. Security is a first-tier property
  3. Interop widely
  4. Maintain behavior for existing features

As you can figure out for yourself, these four bullet points often collide with each other. Checking off all four is not always possible. They can be hard enough on their own.

Protocol specifications

There are conflicting specifications. Specifications vary over time. They can be hard to interpret to figure out exactly what they say one should do.

Security

Increasing security might at the same time break existing use cases for existing users. It might violate what the specs say. It might add friction in the ability to interoperate with others. It might not even be allowed according to specifications.

Interop

This often mean to not follow specifications they way we want to read them, because apparently others do not read them the same way or sometimes they just disregard what the specs say. At times, it is hard to increase security levels by default because it would hamper interop with others.

Maintain behavior

The scripts written 15 years ago that use curl should continue working. The applications written to use libcurl can upgrade libcurl and its Internet transfers just continue. We do not break existing established behaviors. This may very well conflict both with interop and protocol updates, and sometimes it is hard to tighten the security because it would hurt a certain share of existing users.

How does curl deviate from which specs?

I consider this question more or less impossible to answer to, to document and to keep accurate over time. At least it would be a huge and energy-consuming effort both to get the list done but it would also be a monster task to maintain. And it would involve a lot of gray zones.

What is important to me is not what RFCs curl follows nor what or how it deviates from them. I have also basically never gotten that question from a user.

Users want reliable Internet transfers that are secure and interoperate correctly and conveniently with other “players” out there. They want consistent behavior and backwards compatibility.

If you use curl to perform feature X over protocol Y version Z, does it matter which set of RFCs that this would touch and does anyone care about the struggles we have been through when we implemented this set? How many users can even grasp or follow the implication of mentioning that for RFC XYX section A.B we decided to disregard a SHOULD NOT at times?

And how on earth would we keep that up-to-date when we do bugfixes and RFCs are updated down the line?

No one else documents this

The browsers have several hundreds of paid engineers on staff involved and they do not provide documentation like this. Neither does any curl alternative or competitor to my knowledge.

I don’t know of any tool or software anywhere that offer such a deviance documentation and I can perfectly understand and sympathize with why that is so.

There is a tab in my cookie

An HTTP cookie, is just a name + value pair sent from the server to the client. That pair is stored and is sent back to the server in subsequent requests when conditions match.

Cookies were first invented and used in the 1990s. Sources seem to agree that the first browser to support them, was Netscape 0.9beta released in September 1994. Internet Explorer added support in October 1995.

After many years and several failed specification attempts, they were eventually documented in RFC 6265 in 2011. They have been debated, criticized and misunderstood since virtually forever. Mostly because of the abuse/tracking they (used to?) allow in browsers. Less so because of how they actually work over the wire.

curl

curl has supported cookies since the 1990s as well (October 1998 to be specific), and it is a frequently used feature among curl users everywhere. Not the least because the login pattern of the web has become HTTP POST with credentials with a session cookie returned on success – and curl is often used to mimic or reproduce such operations to allow for automated processes and more.

For curl, it is important to remain interoperable and compatible with cookies the same way browsers do them so that users can keep doing these things.

What to accept

Not very long ago, I blogged about a cookie change we had to do because curl’s former liberal attitude towards what a cookie might contain turned out to be a possible vector for mischief. That flaw was basically a direct result of curl never totally adapting to the language in RFC 6265 because we typically do not change what seems to work and has not been reported to be wrong.

In that curl change, we started rejecting incoming cookies that contain “control codes” but we let ASCII code 9 through. ASCII code 9 is the tab character. Generally considered white space. We let it through because we checked the source code for two major browsers and they since they do, curl does as well.

Tab where?

But accepting tabs in the cookie line is one thing. The next question then comes where exactly in the cookie line should it be acceptable?

In the currently ongoing security audit for curl, our friends at Trail of Bits figured out that if a cookie is sent to curl with a tab in the name (literally inside the name and not before or after, like foo<tab>bar) curl would save that wrongly if saved to a file. This, because curl saves cookies in a tab-separated file format, the so called Netscape cookie file format, and it has no escape mechanism or anything. A tab in the cookie name causes the cookie to later get treated wrongly when the file is later loaded again.

Interop status

The file format curl uses for saving cookies is the same as the original format Netscape and then early Firefox used back in the old days. Since this format does not support tabs, I believe it is reasonable to assume the early browsers did not accept tabs in names or content.

I checked how (current) Chrome and Firefox handle cookies like this by creating a test page that sends cookies to the browser.

Chrome rejects cookies with tabs in name or content. They are simply not accepted or stored.

Firefox rejects cookies with tabs in name, but strangely enough it strips tabs from the content and otherwise let them through.

(Safari doesn’t work on Linux so I ignored that)

This, even if my reading of the RFC 6265 seems to say that they should be fine. They should be kept when “internal” to a name or content. I believe curl followed the spec here better than the browsers.

But clearly tabs in cookie names or content is not an interoperable concept on the web.

Adding or removing

With all this in my luggage, I decided to bring this question to the team working on the cookie spec update. The rfc6265bis effort.

I figured that this non-interoperable state and support situation could be worth highlighting or perhaps make a bit stricter in the spec update.

In that issue on GitHub, I was instead informed that recently changed language in their RFC draft rather made browser implementers keen on adding support for this kind of tabs in cookies.

Instead of admitting defeat and documenting that tabs in cookie names and values do not work correctly, we would rather continue limping along pretending this will work.

The HTTP community have supported cookies for almost thirty years without tabs working correctly inside cookies.

Adding (clearer) support for them in a spec would be good in the sense that more defined behavior is a good thing, but since we have decades of this non or perhaps spotty support the already deployed software and long tail of clients will not adapt to any such new wording rapidly. Even if accepted in the new spec, it will take ages until cookies could be done interoperable with tabs inside.

I believe we are better off just documenting that tabs SHOULD NOT be used in cookie name or content as they will not interop. Because that is the truth and will be for a long time no matter what we do today or tomorrow.

My comments in that issue at least seem to have brought reason for maybe reconsider the draft wording. Maybe.

Why!

Someone might ask the excellent question “why would anyone want to use a tab in the name or content?”, but quite frankly, I cannot think of any good or legitimate reason other than maybe laziness or a lack of proper filtering.

There is no sound technical reason why this needs to be done.

An executive decision

To fix the breakage curl does when saving cookies like this, and to align better with existing browsers, we decided to rather make curl reject incoming cookies that have tabs in names or content. Starting in the next release: 7.86.0.

We needed to do something. I believe going with the more strict approach is the better one here and now.

If the rfc6265bis draft ends up ultimately keeping the language “encouraging” tab support and browser authors decide to follow, then I presume we will have reason to revisit this decision later on and perhaps take curl in the other direction instead. I think we can handle that, even if I believe it would be the wrong thing for the ecosystem.

Credits

Cookie image by StockSnap from Pixabay

The first 300 setopts

Already when the first version of curl shipped in 1998, I had plans and ideas in the back of my head to turn it to a library at some point. I had already before worked on providing libraries with APIs for applications and I appreciated their powers.

During the summer of 2000 I refactored the curl internals so that it would become a library with an exposed API that we could provide to the world and then let applications get the same file transfer capabilities that the curl command line tool has.

libwww

I was not aware of any existing library alternative that provided a plain transfer-oriented functionality. There was libwww, but that seemed to have a rather different focus and other users in mind. I wanted something simpler.

ioctl

I found the inspiration for the libcurl *setopt() concept in how ioctl() and fcntl() work. They set options for generic file descriptors. A primary idea would be to not have to add new function calls or change the API when we invent new options that can be set.

easy

As I designed the first functions for libcurl, I anticipated that we perhaps would want to add a more advanced API at a later point. The first take would make a straight forward way to synchronous internet transfers. As this was the initial basic API I decided to call it the “easy” interface. Several of the functions in libcurl are hence prefixed with “curl_easy”.

curl_easy_setopt() became the foundational function in the libcurl API. The one that sets “options” for a libcurl easy handle.:

CURLcode curl_easy_setopt(CURL *handle,
                          CURLoption option,   
                          parameter);

We called the first libcurl version 7.1 in August 2000. I decided to skip 7.0 completely just to avoid confusions as I had shipped a series of pre-releases using that version number.

libcurl version 7.1 supported 59 different options for curl_easy_setopt. They were basically all the command line options existing at the time converted to API mechanisms, and then the command line options were mapped to those options. In many ways that mapping has continued since then, as the command line tool remains to a large extent a wrapper to allow the command line tool to set and use the necessary libcurl options.

Growth

It took four years to double the amount of options and ten years alter the official count was at 180.

Today, in September 2022, we recently merged code that made the setopt counter reach 300 and this is the number of options that will ship in the pending 7.86.0 release. After 22 years we’ve added 241 new options, almost 11 new options per year on average.

Every new option comes with a cost: more code, more tests, more documentation and an even larger forest in which users can get lost when they try to figure out how to tell libcurl to behave the way the want it. The benefit of course being that libcurl gets one more capability and new chances to fulfill users’ wishes. New options certainly are both a blessing and a curse.

Deprecating

We have decided to never break existing behaviors, which means that we don’t remove old options – ever – but we may deprecate them. This also contributes to the large amount, as for many new options we add, we have documented that an older one should not be used but it still exists for backwards compatibility.

Downsides

A benefit with using this API concept is that we can easily add new options without introducing new function calls.

A downside with using this API concept is that I made the function curl_easy_setopt accept a “vararg”, meaning that the third argument passed to this function can be any type, and what type that is supposed to be used is dictated by the particular option that is used as the second argument.

While varargs is a cool C feature, it is bad in the sense that it takes away the compiler’s ability to check the argument at compile-time and instead makes it error-prone for users and forces libcurl to try to work around this limitation. If I would redo the API today, I would probably not do it exactly like this, as too many users shoot themselves in the foot with this.

Future

Predicting what comes next is impossible, but if I were to guess I would say that we are likely to keep on adding options even in the future.

Looking back, we can see a fairly steady growth and I cannot see any recent developments in the project or in the surrounding ecosystems that would make us deviate from this path in the short term future at least.

Taking curl documentation quality up one more notch

Tldr; test and verify as much as possible also in the documentation.

I’m a sloppy typist. When I write several words in a row, like for example when creating complete sentences for something like a blog post, one or two of the words end up slightly misspelled.

Sure, many editors and systems have runtime spellchecks these days and they make it easy to quickly fix typos, but not all systems are like that and there are also situations where there are many false positives due to formatting or just the range of “special” words. They also rarely yell at me when I overuse the word “very” or start sentences with “But”.

curl documentation

I work fiercely on making the curl and libcurl documentation top-notch state of the art good and complete. I want my users to feel that. Everything is documented; clearly and with details and examples.

I want and aim for libcurl to be the best documented software library in the world.

Good documentation does not come for free or easily. It requires dedicated work and a lot of effort put into it.

This is of course a never-ending effort as things change over time and we have an almost ridiculous amount of options and details to document.

The key to improve ourselves is of course two good old classics: tests and CI jobs. This works great even for documentation, and perhaps in particular for technical documentation that includes lots of symbols and name references that need to be correct.

As I have recently worked on tightening some bolts and made it harder to land typos, I wanted to take the opportunity to describe some of our ways.

symbols-in-versions

Early 2009 I had some interactions with people in the git project and we discussed their use of libcurl. As we introduce new features to curl over time, users who build with curl may want to write their code to conditionally use the new stuff if they have a new enough libcurl installed, or just skip those features if the installation is too old. git is an application like that. They use libcurl a lot and they offer to build with libcurl installations that are maybe a dozen years old.

I then created a file in the libcurl git repository that I named symbols-in-versions. It lists all publicly provided curl symbols and in which libcurl release they were introduced. A good resource for libcurl users. It took quite an effort to figure them all out after the fact.

Over time, the number of entries in this file has grown significantly.

Tests

Of course, in order to do good CI jobs, they need to have tests to run so we start there.

I will mention some test numbers below. The test numbers in curl do not have any inherent meaning, they are just unique identifiers. To help us find the test source files and refer to tests and their failures easily.

Test 1119

Test 1119 was introduced in November 2010 as I realized I needed to make sure that symbols-in-versions (SIV) is kept up-to-date. It will be a useless document if it lags behind or misses symbols. It needs to include them all and the info needs to be correct.

I wrote a script that extracts all globally provided symbols in some curl header files and then verifies that they are all listed in SIV.

This test now made it very clear when we forgot to add a name to SIV, and it also pointed out if one of the names in SIV for example had a typo.

Test 1139

Scan SIV, figure out all existing options provided for three key libcurl functions: curl_easy_setopt, curl_multi_setopt and curl_easy_getinfo. Then verify that they all are mentioned in the respective “main” man page (curl_easy_setopt.3 etc), where they refer to the individual separate page for the option.

This test also verifies that the curl tool’s man page (curl.1) lists exactly the same set of command line options as is listed in the tool’s source code file tool_getparam.c and that is shown in the tool’s --help output. Consistency is king.

Test 1167

To make sure the symbols we provide in libcurl header files all use the correct name space we created test 1167. Using the correct name space in this context means that all publicly provided symbols need to start with curl of libcurl, case insensitively. It is important for several reasons, first of course because a good library does not pollute the name space to risk collisions and problems, but also using the correct prefix is important so that test 1119 finds all the symbols correctly. So they need to use the right prefix, and when they do, they are scanned and verified correctly.

Test 1173

For libcurl we have several function calls that take options. In some cases these functions accept a very large amount of different options. Every such option is documented in its own dedicated man page. Over time, with lots of contributors working on the project, the different man pages were not all including the same information in the same order and a huge portion of them even missed one of the most important details in programming documentation: examples.

Test 1173 checks all libcurl man pages and verifies that they have the eight mandatory libcurl sections present (NAME, SYNOPSIS, DESCRIPTION etc) and that they all are in the right order and that there is an example section that is more than 2 lines.

This test also does basic nroff formatting verification so that we know the page will look decent in a man page viewer too.

Helps us greatly – especially when we add new man pages.

Proselint

The tool that taught me to stop using the word “very” also finds a lot of other common bad takes on English is called proselint. Since a while back we run a CI job that runs proselint on all markdown files in the curl git repository. It helps us detect and edit away some amount of bad language.

Spellcheck

At the time of this writing, there are 482 individual libcurl related man pages and there is a total of around 85,000 lines of documentation in the project. I decided we should run a spellcheck on these man pages in an attempt to reduce the number of typos and mistakes.

The CI job I created for this first strips out some sections from the man pages that we deem too hard to spellcheck: the SYNOPSIS and the EXAMPLES sections for example. The script also removes all names that look like public curl symbols, as spellchecking them with a normal spellchecker is just impossible and they need special treatment. See further below for that.

Finally, we convert the stripped man page versions into markdown – because we have no spellchecker tools for nroff – and then spellcheck those.

It took far many more hours than I had anticipated to eradicate all the spelling mistakes and we ended up with an custom dictionary with over 800 words that aspell does not like but that I insist are valid for us.

Verify curl symbols

As I mentioned above, we strip out the curl symbols to hide them from the spellchecker.

Instead I extended the test 1119 mentioned above to also scan through all the libcurl man pages and find every single mention of something that looks like the name of a public curl symbol – and then match those against the names present in SIV and output an error if a symbol was referenced that was not documented already and therefore not actually a public curl symbol. With this, no man page can reference a non-existing curl symbol. Every such typo is detected.

Links for reporting on docs bugs

No matter how hard we try, there will always be errors that sneak in anyway and there will be sentences and phrasing that might have felt good at the time of writing but later, in the view of someone else, do not communicate the right message or maybe mislead users to misunderstand functionality.

Bug reports on documentation is key to finding such warts so that we can correct them. In the curl project we make it as easy as possible to report bugs in documentation by providing direct links on virtually all man pages shown on the website. The link takes you directly to the “new issue” page with a template subject filled in with the man page’s name.

This convenience unfortunately leads to a certain amount of “issue spam” but I think that is still a fairly cheap price to pay.

Everything curl

The book is a treasure trove of additional and complementary curl documentation but it is actually written and maintained outside of the curl repository. It has its own set of CI tests, including proselint and spellchecks.

Further

All these tests have been added gradually and slowly over a long period. It gives us time to polish and work out possible flaws in the tests and lets us make sure the work as intended and don’t block development.

I don’t have any immediate pending new pull requests for checking the curl documentation but if there still are details in there that we can check that we currently do not, I am sure that we will find those over time and make sure we verify them too.

If you have ideas and suggestions, I am all ears.

Related

Making world-class docs takes effort

convert a curl cmdline to libcurl source code

The dash-dash-libcurl is the sometimes missed curl gem that you might want to know about and occasionally maybe even use.

How do I convert

There is a very common pattern in curl land: a user who is writing an application using language L first does something successfully with the curl command line tool and then the person wants to convert that command line into the program they are writing.

How do you convert this curl command line into a transfer using programming language XYZ?

Language bindings

There is a huge amount of available bindings for libcurl. Bindings, as in adjustments and glue code for languages to use libcurl for Internet transfers so that they do not have to write the application in C.

At a recent count I found more than 60 libcurl bindings. They allow authors of virtually any programming language to do internet transfers using the powers of libcurl.

Most of these bindings are “thin” and follow the same style of the original C API: You create a handle for which you set options and then you do the transfer. This way they are fairly easy to keep up to date with the always-changing and always-improving libcurl releases.

Say hi to --libcurl

Most curl command line options are mapped one-to-one to an underlying libcurl option so for some time I tried to help users by explaining what they map to. Until one day I realized that

Hey! curl itself already has this mapping in source code, it just needs a way to show it to users!

How would it best output this mapping?

The --libcurl command line option was added to curl already back in 2007 and has been been part of the tool since version 7.16.1.

Show this command line as libcurl code

It is really easy to use too.

  1. Create a complicated curl command line that does what you need it to do.
  2. Append --libcurl example.c to the command line and run it again.
  3. Inspect your newly generated file example.c for how you could write your application to do the same thing with libcurl.

If you want to use a libcurl binding rather than the C API, like perhaps write your code in Python or PHP, you need to convert the example code into your programming language but much thanks to the options keeping their names across the different bindings it is usually a trivial task.

Example

The simplest possible command line:

curl curl.se

It gets HTTP from the site “curl.se”. If you try it, you will not see anything because it just replies with a redirect to the HTTPS version of the page. But ignoring that, you can convert that action into a libcurl program like this:

curl curl.se --libcurl code.c

The newly created file code.c now contains a program that you can compile :

gcc -o getit code.c -lcurl

and then run

./getit

You might want your program to rather follow the redirect? Maybe even show debug output? Let’s rerun the command line and get a code update:

curl curl.se --verbose --location --libcurl code.c

Now, if you rebuild your program and run it again, it shows you the front page HTML of the curl website on stdout.

The code

The exact code my curl version 7.85.0 produced in the command line above is shown below.

You see several options that are commented out. Those were used by the command line tool but there is no easy or convenient way to show their use in the example. Often you can start out by just skipping those .

http://http://http://@http://http://?http://#http://

The other day I sent out this tweet

"http://http://http://@http://http://?http://#http://"

is a legitimate URL

As it took off, got an amazing attention and I received many different comments and replies, I felt a need to elaborate a little. To add some meat to this.

Is this string really a legitimate URL? What is a URL? How is it parsed?

http://http://http://@http://http://?http://#http://

curl

Let’s start with curl. It parses the string as a valid URL and it does it like this. I have color coded the sections below a bit to help illustrate:

http://http://http://@http://http://?http://#http://

Going from left to right.

http – the scheme of the URL. This speaks HTTP. The “://” separates the scheme from the following authority part (that’s basically everything up to the path).

http – the user name up to the first colon

//http:// – the password up to the at-sign (@)

http: – the host name, with a trailing colon. The colon is a port number separator, but a missing number will be treated by curl as the default number. Browsers also treat missing port numbers like this. The default port number for the http scheme is 80.

//http:// – the path. Multiple leading slashes are fine, and so is using a colon in the path. It ends at the question mark separator (?).

http:// – the query part. To the right of the question mark, but before the hash (#).

http:// – the fragment, also known as “anchor”. Everything to the right of the hash sign (#).

To use this URL with curl and serve it on your localhost try this command line:

curl "http://http://http://@http://http://?http://#http://" --resolve http:80:127.0.0.1

The curl parser

The curl parser has been around for decades already and do not break existing scripts and applications is one of our core principles. Thus, some of the choices in the URL parser we did a long time ago and we stand by them, even if they might be slightly questionable standards wise. As if any standard meant much here.

The curl parser started its life as lenient as it could possibly be. While it has been made stricter over the years, traces of the original design still shows. In addition to that, we have also felt that we have been forced to make adaptions to the parser to make it work with real world URLs thrown at it. URLs that maybe once was not deemed fine, but that have become “fine” because they are accepted in other tools and perhaps primarily in browsers.

URL standards

I have mentioned it many times before. What we all colloquially refer to as a URL is not something that has a firm definition:

There is the URI (not URL!) definition from IETF RFC 3986, there is the WHATWG URL Specification that browsers (try to) adhere to and there are numerous different implementations of parsers being more or less strict when following one or both of the above mentioned specifications.

You will find that when you scrutinize them into the details, hardly any two URL parsers agree on every little character.

Therefore, if you throw the above mentioned URL on any random URL parser they may reject it (like the Twitter parser didn’t seem to think it was a URL) or they might come to a different conclusion about the different parts than curl does. In fact, it is likely that they will not do exactly as curl does.

Python’s urllib

April King threw it at Python’s urllib. A valid URL! While it accepted it as a URL, it split it differently:

ParseResult(scheme='http',
            netloc='http:',
            path='//http://@http://http://',
            params='',
            query='http://',
            fragment='http://')

Given the replies to my tweet, several other parsers did the similar interpretation. Possibly because they use the same underlying parser…

Javacript

Meduz showed how the JavaScript URL object treats it, and it looks quite similar to the Python above. Still a valid URL.

Firefox and Chrome

I added the host entry ‘127.0.0.1 http‘ into /etc/hosts and pasted the URL into the address bar of Firefox. It rewrote it to

http://http//http://@http://http://?http://#http://

(the second colon from the left is removed, everything else is the same)

… but still considered it a valid URL and showed the contents from my web server.

Chrome behaved exactly the same. A valid URL according to it, and a rewrite of the URL like Firefox does.

RFC 3986

Some commenters mentioned that the unencoded “unreserved” letters in the authority part make it not RFC 3986 compliant. Section 3.2 says:

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

Meaning that the password should have its slashes URL-encoded as %2f to make it valid. At least. So maybe not a valid URL?

Update: it actually still qualifies as “valid”, it just is parsed a little differently than how curl does it. I do not think there is any question that curl’s interpretation is not matching RFC 3986.

HTTPS

The URL works equally fine with https.

https://https://https://@https://https://?https://#https://

The two reasons I did not use https in the tweet:

  1. It made it 7 characters longer for no obvious extra fun
  2. It is harder to prove working by serving your own content as you would need curl -k or similar to make the host name ‘https’ be OK vs the name used in the actual server you would target.

The URL Buffalo buffalo

A surprisingly large number of people thought it reminded them of the old buffalo buffalo thing.

A bug that was 23 years old or not

This is a tale of cookies, Internet code and a CVE. It goes back a long time so please take a seat, lean back and follow along.

The scene is of course curl, the internet transfer tool and library I work on.

1998

In October 1998 we shipped curl 4.9. In 1998. Few people had heard of curl or used it back then. This was a few months before the curl website would announce that curl achieved 300 downloads of a new release. curl was still small in every meaning of the word at that time.

curl 4.9 was the first release that shipped with the “cookie engine”. curl could then receive HTTP cookies, parse them, understand them and send back cookies properly in subsequent requests. Like the browsers did. I wrote the bigger part of the curl code for managing cookies.

In 1998, the only specification that existed and described how cookies worked was a very brief document that Netscape used to host called cookie_spec. I keep a copy of that document around for curious readers. It really does not document things very well and it leaves out enormous amounts information that you had to figure out by inspecting other clients.

The cookie code I implemented than was based on that documentation and what the browsers seemed to do at the time. It seemed to work with numerous server implementations. People found good use for the feature.

2000s

This decade passed with a few separate efforts in the IETF to create cookie specifications but they all failed. The authors of these early cookie specs probably thought they could create standards and the world would magically adapt to them, but this did not work. Cookies are somewhat special in the regard that they are implemented by so many different authors, code bases and websites that fundamentally changing the way they work in a “decree from above” like that is difficult if not downright impossible.

RFC 6265

Finally, in 2011 there was a cookie rfc published! This time with the reversed approach: it primarily documented and clarified how cookies were actually already being used.

I was there and I helped it get made by proving my views and opinions. I did not agree to everything that the spec includes (you can find blog posts about some of those details), but finally having a proper spec was still a huge improvement to the previous state of the world.

Double syntax

What did not bother me much at the time, but has been giving me a bad rash ever since, is the peculiar way the spec is written: it provides one field syntax for how servers should send cookies, and a different one for what syntax clients should accept for cookies.

Two syntax for the same cookies.

This has at least two immediate downsides:

  1. It is hard to read the spec as it is very easy to to fall over one of those and assume that syntax is valid for your use case and accidentally get the wrong role’s description.
  2. The syntax defining how to send cookie is not really relevant as the clients are the ones that decide if they should receive and handle the cookies. The existing large cookie parsers (== browsers) are all fairly liberal in what they accept so nobody notices nor cares about if the servers don’t follow the stricter syntax in the spec.

RFC 6265bis

Since a few years back, there is ongoing work in IETF on revising and updating the cookie spec of 2011. Things have evolved and some extensions to cookies have been put into use in the world and deserves to be included in the spec. If you would to implement code today that manage cookies, the old RFC is certainly not enough anymore. This cookie spec update work is called 6265bis.

curl is up to date and compliant with what the draft versions of RFC 6265bis say.

The issue about the double syntax from above is still to be resolved in the document, but I faced unexpectedly tough resistance when I recently shared my options and thoughts about that spec peculiarity.

It can be noted that fundamentally, cookies still work the same way as they did back in 1998. There are added nuances and knobs sure, but the basic principles have remained. And will so even in the cookie spec update.

One of oddities of cookies is that they don’t work on origins like most other web features do.

HTTP Request tunneling

While cookies have evolved slowly over time, the HTTP specs have also been updated and refreshed a few times over the decades, but perhaps even more importantly the HTTP server implementations have implemented stricter parsing policies as they have (together with the rest of the world) that being liberal in what you accept (Postel’s law) easily lead to disasters. Like the dreaded and repeated HTTP request tunneling/smuggling attacks have showed us.

To combat this kind of attack, and probably to reduce the risk of other issues as well, HTTP servers started to reject incoming HTTP requests early if they appear “illegal” or malformed. Block them already at the door and not letting obvious crap in. In particular this goes for control codes in requests. If you try to send a request to a reasonably new HTTP server today that contains a control code, chances are very high that the server will reject the request and just return a 400 response code.

With control code I mean a byte value between 1 and 31 (excluding 9 which is TAB)

The well known HTTP server Apache httpd has this behavior enabled by default since 2.4.25, shipped in December 2016. Modern nginx versions seem to do this as well, but I have not investigated since exactly when.

Cookies for other hosts

If cookies were designed today for the first time, they certainly would be made different.

A website that sets cookies sends cookies to the client. For each cookie it sends, it sets a number of properties for the cookie. In particular it sets matching parameters for when the cookie should be sent back again by the client.

One of these cookie parameters set for a cookie is the domain that need to match for the client to send it. A server that is called www.example.com can set a cookie for the entire example.com domain, meaning that the cookie will then be sent by the client also when visiting second.example.com. Servers can set cookies for “sibling sites!

Eventually the two paths merged

The cookie code added to curl in 1998 was quite liberal in what content it accepted and while it was of course adjusted and polished over the years, it was working and it was compatible with real world websites.

The main driver for changes in that area of the code has always been to make sure that curl works like and interoperates with other cookie-using agents out in the wild.

CVE-2022-35252

In the end of June 2022 we received a report of a suspected security problem in curl, that would later result in our publication of CVE-2022-35252.

As it turned out, the old cookie code from 1998 accepted cookies that contained control codes. The control codes could be part of the name or the the content just fine, and if the user enabled the “cookie engine” curl would store those cookies and send them back in subsequent requests.

Example of a cookie curl would happily accept:

Set-Cookie: name^a=content^b; domain=.example.com

The ^a and ^b represent control codes, byte code one and two. Since the domain can mark the cookie for another host, as mentioned above, this cookie would get included for requests to all hosts within that domain.

When curl sends a cookie like that to a HTTP server, it would include a header field like this in its outgoing request:

Cookie: name^a=content^b

400

… to which a default configure Apache httpd and other servers will respond 400. For a script or an application that received theses cookies, further requests will be denied for as long as the cookies keep getting sent. A denial of service.

What does the spec say?

The client side part of RFC 6265, section 5.2 is not easy to decipher and figuring out that a client should discard cookies with control cookies requires deep studies of the document. There is in fact no mention of “control codes” or this byte range in the spec. I suppose I am just a bad spec reader.

Browsers

It is actually easier to spot what the popular browsers do since their source codes are easily available, and it turns out of course that both Chrome and Firefox already ignore incoming cookies that contain any of the bytes

%01-%08 / %0b-%0c / %0e-%1f / %7f

The range does not include %09, which is TAB and %0a / %0d which are line endings.

The fix

The curl fix was not too surprisingly and quite simply to refuse cookie fields that contain one or more of those banned byte values. As they are not accepted by the browser’s already, the risk that any legitimate site are using them for any benign purpose is very slim and I deem this change to be nearly risk-free.

The age of the bug

The vulnerable code has been in curl versions since version 4.9 which makes it exactly 8,729 days (23.9 years) until the shipped version 7.85.0 that fixed it. It also means that we introduced the bug on project day 201 and fixed it on day 8,930.

The code was not problematic when it shipped and it was not problematic during a huge portion of the time it has been used by a large amount of users.

It become problematic when HTTP servers started to refuse HTTP requests they suspected could be malicious. The way this code turned into a denial of service was therefore more or less just collateral damage. An unfortunate side effect.

Maybe the bug was born first when RFC 6265 was published. Maybe it was born when the first widely used HTTP server started to reject these requests.

Project record

8,729 days is a new project record age for a CVE to have been present in the code until found. It is still the forth CVE that were lingering around for over 8,000 days until found.

Credits

Thanks to Stefan Eissing for digging up historic Apache details.

Axel Chong submitted the CVE-2022-35252 report.

Campfire image by Martin Winkler from Pixabay

curl’s TLS fingerprint

Every human has a unique fingerprint. With only an impression of a person’s fingertip, it is possible to follow the lead back to the single specific individual wearing that unique pattern.

TLS fingerprints

The phrase TLS fingerprint is of course in this spirit. A pattern in a TLS handshake that allows an involved party to tell or at least guess with a certain level of accuracy what client software that performed it – purely based on how exactly the TLS magic is done. There are numerous different ways and variations a client can perform a TLS handshake and still be standards compliant. There is a long list of extensions that can vary in content, the order of the list of extensions, the ciphers to accept, the allowed TLS versions, steps performed, the order and sequence of those steps and more.

When a network client connects to a remote site and makes a TLS handshake with the server, the server can basically add up all those details and make an educated guess exactly which client that connects to it. One method to do it is called JA3 and produces a 32 digit hexadecimal number as output. (The three creators of this algorithm all have JA as their initials!)

In use out there

I have recently talked with customers and users who have faced servers that refused them access when they connected to sites using curl, but allowed them access to the site when they instead use one of the popular browsers – or if curl was tweaked to look like one of those browsers. It might be a trend in the making. There might be more sites out there now that reject clients that produce the wrong fingerprint then there used to be.

Why

Presumably there are many reasons why servers want to limit access to a subset of clients, but I think the general idea is that they want to prevent “illegitimate” user agents from accessing their sites.

For example, I have seen online market sites use this method in an what I have perceived as an attempt to block bots and web scrapers. Or they do it to block malware or other hostile clients that scour their website.

How

There’s this JA3 page that shows lots of implementations for many services that can figure out clients’ TLS fingerprints and act on them. And there’s nothing that says you have to do it with JA3. There’s likely to be numerous other ways and algorithms as well.

There are also companies that offer commercial services to filter off mismatching clients from your site. This is real business.

A TLS Client hello message has lots of info.

Other fingerprinting

In the earlier days of the web, web sites used more basic ways to detect and filter out bots and non-browser user clients. The original and much simpler way is to check the User-Agent: field that HTTP clients pass on, but has also sometimes been extended to check the order of the sent HTTP headers and in some cases, servers have used elaborate JavaScript schemes in order to try to “smoke out” the clients that don’t seem to act like full-fledged browsers.

If the clients use HTTP/2, that too allows for more details to fingerprint.

As the web has transitioned over to almost exclusively use HTTPS, it has severely increased the ways a server can fingerprint clients, and at the same time made it harder for non-browser clients to look exactly like browsers.

Allow list or block list

Sites that use TLS fingerprints to allow access, of course do not want too many false positives. They want to allow all “normal” browser-based visitors, even if they use a little older versions and also if they use somewhat older or less common operating systems.

This means that they either have to work hard to get an extensive list of acceptable hashes in an accept list or they add known non-desired clients in a block list. I would imagine that if you go the accept list route, that’s how companies can sell this services as that is maintenance intensive work.

Users of alternative and niche browsers are sometimes also victims in this scheme if they stand out enough.

Altering the fingerprint

The TLS fingerprints have the interesting feature compared to human fingertip prints, that they are the result of a set of deliberate actions and not just a pattern you are born to wear. They are therefore a lot easier to change.

With curl version C using TLS library T of version V, the TLS fingerprint is a function that involves C, T and V. And the options O set by curl. By changing one or more of those variables, you are likely to alter the TLS fingerprint.

Match a browser exactly

To be most certain that no site will reject your curl request because of its TLS fingerprint, you would alter the print to look exactly like the one of a popular browser. You can suspect that most sites want their regular human browser-using visitors to be able to access them.

To make curl look exactly like a browser you also likely need to do more than just change C, O, T and V from the section above. You also need to make sure that the TLS library you use produces its lists of extensions and ciphers in exactly the same order etc. This may require that you alter options and maybe even source code.

curl-impersonate

This is a custom build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate performs TLS and HTTP handshakes that are identical to that of a real browser.

curl-impersonate is a modified curl build and the project also provides docker images and more to help users to use it easily.

I cannot say right now if any of the changes done for curl-impersonate will get merged into the upstream curl project, but it will also depend on what users want and how the use of TLS fingerprinting spread or changes going forward.

Program a browser

Another popular way to work around this kind of blocking is to simply program a browser to do the job. Either a headless browser or with tools like Selenium. Since these methods make the TLS handshake using a browser “engine”, they are unlikely to get blocked by these filters.

Cat and mouse

Servers add more hurdles to attempt to block unwanted clients.

Clients change to keep up with the servers and to still access the sites in spite of what the server admins want.

Future

As early as only a few years ago I had never heard of any site that blocked clients because of their TLS handshake. Through recent years I have seen it happen and the use of it seems to have increased. I don’t know of any way to measure if this is actually true or just my feeling.

I cannot rule out that we are going to see this more going forward, even if I also believe that the work on circumventing these fingerprinting filters is just getting started. If the circumvention grows and becomes easy enough, maybe it will stifle servers from adding these filters as they will not be effective anyway?

Let us come back to this topic in a few years and see where it went.

Credits

Fingerprint image by Hebi B. from Pixabay

curl 7.85.0 for you

Welcome to a new curl release, the result of a slightly extend release cycle this time.

Release presentation

Numbers

the 210th release
3 changes
65 days (total: 8,930)

165 bug-fixes (total: 8,145)
230 commits (total: 29,017)
0 new public libcurl function (total: 88)
2 new curl_easy_setopt() option (total: 299)

0 new curl command line option (total: 248)
79 contributors, 38 new (total: 2,690)
44 authors, 22 new (total: 1,065)
1 security fixes (total: 126)
Bug Bounties total: 40,900 USD

Security

We have yet another CVE to disclose.

control code in cookie denial of service

CVE-2022-35252 allows a server to send cookies to curl that contain ASCII control codes. When such cookies subsequently are sent back to a server, they will cause 400 responses from servers that downright refuse such requests. Severity: low. Reward: 480 USD.

Changes

This release counts three changes. They are:

schannel backend supports TLS 1.3

For everyone who uses this backend (which include everyone who uses the curl that Microsoft bundles with Windows) this is great news: now you too can finally use TLS 1.3 with curl. Assuming that you use a new enough version of Windows 10/11 that has the feature present. Let’s hope Microsoft updates the bundled version soon.

CURLOPT_PROTOCOLS_STR and CURLOPT_REDIR_PROTOCOLS_STR

These are two new options meant to replace and be used instead of the options with the same names without the “_STR” extension.

While working on support for new future protocols for libcurl to deal with, we realized that the old options were filled up and there was no way we could safely extend them with additional entries. These new functions instead work on text input and have no limit in number of protocols they can be made to support.

QUIC support via wolfSSL

The ngtcp2 backend can now also be built to use wolfSSL for the TLS parts.

Bugfixes

This was yet again a cycle packed with bugfixes. Here are some of my favorites:

asyn-thread: fix socket leak on OOM

Doing proper and complete memory cleanup even when we exist due to out of memory is sometimes difficult. I found and fixed this very old bug.

cmdline-opts/gen.pl: improve performance

The script that generates the curl.1 man page from all its sub components was improved and now typically executes several times faster then before. curl developers all over rejoice.

configure: if asked to use TLS, fail if no TLS lib was detected

Previously, the configure would instead just silently switch off TLS support which was not always easy to spot and would lead to users going further before they eventually realize this.

configure: introduce CURL_SIZEOF

The configure macro that checks for size of variable types was rewritten. It was the only piece left in the source tree that had the mention of GPL left. The license did not affect the product source code or the built outputs, but it caused questions and therefore some friction we could easily avoid by me completely writing away the need for the license mention.

close the happy eyeballs loser connection when using QUIC

A silly memory-leak when doing HTTP/3 connections on dual-stack machines.

treat a blank domain in Set-Cookie: as non-existing

Another one of those rarely used and tiny little details about following what the spec says.

configure: check whether atomics can link

This, and several other smaller fixes together improved the atomics support in curl quite a lot since the previous version. We conditionally use this C11 feature if present to make the library initialization function thread-safe without requiring a separate library for it.

digest: fix memory leak, fix not quoted ‘opaque’

There were several fixes and cleanups done in the digest department this time around.

remove examples/curlx.c

Another “victim” of the new license awareness in the project. This example was the only file present in the repository using this special license, and since it was also a bit convoluted example we decided it did not really have to be included.

resolve *.localhost to 127.0.0.1/::1

curl is now slightly more compliant with RFC 6761, follows in the browsers’ footsteps and resolves all host names in the “.localhost” domain to the fixed localhost addresses.

enable obs-folded multiline headers for hyper

curl built with hyper now also supports “folded” HTTP/1 headers.

libssh2:+libssh make atime/mtime date overflow return error

Coverity had an update in August and immediately pointed out these two long-standing bugs – in two separate SSH backends – related to time stamps and 32 bits.

curl_multi_remove_handle closes CONNECT_ONLY transfer

When an applications sets the CONNECT_ONLY option for a transfer within a multi stack, that connection was not properly closed until the whole multi handle was closed even if the associated easy handle was terminated. This lead to connections being kept around unnecessarily long (and wasting resources).

use pipe instead of socketpair on apple platforms

Apparently those platform likes to close socketpairs when the application is pushed into the background, while pipes survive the same happening… This is a change that might be preferred for other platforms as well going forward.

use larger dns hash table for multi interface

The hash table used for the DNS cache is now made larger for the multi interface than when created to be used by the easy interface, as it simply is more likely to be used by many host names then and then it performs better like this.

reject URLs with host names longer than 65535 bytes

URLs actually have no actual maximum size in any spec and neither does the host name within one, but the maximum length of a DNS name is 253 characters. Assuming you can resolve the name without DNS, another length limit is the SNI field in TLS that is an unsigned 16 bit number: 65535 bytes. This implies that clients cannot connect to any SNI-using TLS protocol with a longer name. Instead of checking for that limit in many places, it is now done early.

reduce size of several struct fields

As part of the repeated iterative work of making sure the structs are kept as small as possible, we have again reduced the size of numerous struct fields and rearranged the order somewhat to save memory.

Next

The next release is planned to ship on October 25, 2022.

tech, open source and networking