To celebrate the hubris-infested comment from a few years ago, I created this book cover mock up.
Turned out very popular on Twitter.
Open Source, Free Software, and similar
To celebrate the hubris-infested comment from a few years ago, I created this book cover mock up.
Turned out very popular on Twitter.
An HTTP cookie, is just a name + value pair sent from the server to the client. That pair is stored and is sent back to the server in subsequent requests when conditions match.
Cookies were first invented and used in the 1990s. Sources seem to agree that the first browser to support them, was Netscape 0.9beta released in September 1994. Internet Explorer added support in October 1995.
After many years and several failed specification attempts, they were eventually documented in RFC 6265 in 2011. They have been debated, criticized and misunderstood since virtually forever. Mostly because of the abuse/tracking they (used to?) allow in browsers. Less so because of how they actually work over the wire.
curl has supported cookies since the 1990s as well (October 1998 to be specific), and it is a frequently used feature among curl users everywhere. Not the least because the login pattern of the web has become HTTP POST with credentials with a session cookie returned on success – and curl is often used to mimic or reproduce such operations to allow for automated processes and more.
For curl, it is important to remain interoperable and compatible with cookies the same way browsers do them so that users can keep doing these things.
Not very long ago, I blogged about a cookie change we had to do because curl’s former liberal attitude towards what a cookie might contain turned out to be a possible vector for mischief. That flaw was basically a direct result of curl never totally adapting to the language in RFC 6265 because we typically do not change what seems to work and has not been reported to be wrong.
In that curl change, we started rejecting incoming cookies that contain “control codes” but we let ASCII code 9 through. ASCII code 9 is the tab character. Generally considered white space. We let it through because we checked the source code for two major browsers and they since they do, curl does as well.
But accepting tabs in the cookie line is one thing. The next question then comes where exactly in the cookie line should it be acceptable?
In the currently ongoing security audit for curl, our friends at Trail of Bits figured out that if a cookie is sent to curl with a tab in the name (literally inside the name and not before or after, like foo<tab>bar) curl would save that wrongly if saved to a file. This, because curl saves cookies in a tab-separated file format, the so called Netscape cookie file format, and it has no escape mechanism or anything. A tab in the cookie name causes the cookie to later get treated wrongly when the file is later loaded again.
The file format curl uses for saving cookies is the same as the original format Netscape and then early Firefox used back in the old days. Since this format does not support tabs, I believe it is reasonable to assume the early browsers did not accept tabs in names or content.
I checked how (current) Chrome and Firefox handle cookies like this by creating a test page that sends cookies to the browser.
Chrome rejects cookies with tabs in name or content. They are simply not accepted or stored.
Firefox rejects cookies with tabs in name, but strangely enough it strips tabs from the content and otherwise let them through.
(Safari doesn’t work on Linux so I ignored that)
This, even if my reading of the RFC 6265 seems to say that they should be fine. They should be kept when “internal” to a name or content. I believe curl followed the spec here better than the browsers.
But clearly tabs in cookie names or content is not an interoperable concept on the web.
With all this in my luggage, I decided to bring this question to the team working on the cookie spec update. The rfc6265bis effort.
I figured that this non-interoperable state and support situation could be worth highlighting or perhaps make a bit stricter in the spec update.
In that issue on GitHub, I was instead informed that recently changed language in their RFC draft rather made browser implementers keen on adding support for this kind of tabs in cookies.
Instead of admitting defeat and documenting that tabs in cookie names and values do not work correctly, we would rather continue limping along pretending this will work.
The HTTP community have supported cookies for almost thirty years without tabs working correctly inside cookies.
Adding (clearer) support for them in a spec would be good in the sense that more defined behavior is a good thing, but since we have decades of this non or perhaps spotty support the already deployed software and long tail of clients will not adapt to any such new wording rapidly. Even if accepted in the new spec, it will take ages until cookies could be done interoperable with tabs inside.
I believe we are better off just documenting that tabs SHOULD NOT be used in cookie name or content as they will not interop. Because that is the truth and will be for a long time no matter what we do today or tomorrow.
My comments in that issue at least seem to have brought reason for maybe reconsider the draft wording. Maybe.
Someone might ask the excellent question “why would anyone want to use a tab in the name or content?”, but quite frankly, I cannot think of any good or legitimate reason other than maybe laziness or a lack of proper filtering.
There is no sound technical reason why this needs to be done.
To fix the breakage curl does when saving cookies like this, and to align better with existing browsers, we decided to rather make curl reject incoming cookies that have tabs in names or content. Starting in the next release: 7.86.0.
We needed to do something. I believe going with the more strict approach is the better one here and now.
If the rfc6265bis draft ends up ultimately keeping the language “encouraging” tab support and browser authors decide to follow, then I presume we will have reason to revisit this decision later on and perhaps take curl in the other direction instead. I think we can handle that, even if I believe it would be the wrong thing for the ecosystem.
Already when the first version of curl shipped in 1998, I had plans and ideas in the back of my head to turn it to a library at some point. I had already before worked on providing libraries with APIs for applications and I appreciated their powers.
During the summer of 2000 I refactored the curl internals so that it would become a library with an exposed API that we could provide to the world and then let applications get the same file transfer capabilities that the curl command line tool has.
I was not aware of any existing library alternative that provided a plain transfer-oriented functionality. There was libwww, but that seemed to have a rather different focus and other users in mind. I wanted something simpler.
I found the inspiration for the libcurl *setopt() concept in how ioctl() and fcntl() work. They set options for generic file descriptors. A primary idea would be to not have to add new function calls or change the API when we invent new options that can be set.
As I designed the first functions for libcurl, I anticipated that we perhaps would want to add a more advanced API at a later point. The first take would make a straight forward way to synchronous internet transfers. As this was the initial basic API I decided to call it the “easy” interface. Several of the functions in libcurl are hence prefixed with “curl_easy”.
curl_easy_setopt() became the foundational function in the libcurl API. The one that sets “options” for a libcurl easy handle.:
CURLcode curl_easy_setopt(CURL *handle, CURLoption option, parameter);
We called the first libcurl version 7.1 in August 2000. I decided to skip 7.0 completely just to avoid confusions as I had shipped a series of pre-releases using that version number.
libcurl version 7.1 supported 59 different options for curl_easy_setopt
. They were basically all the command line options existing at the time converted to API mechanisms, and then the command line options were mapped to those options. In many ways that mapping has continued since then, as the command line tool remains to a large extent a wrapper to allow the command line tool to set and use the necessary libcurl options.
It took four years to double the amount of options and ten years alter the official count was at 180.
Today, in September 2022, we recently merged code that made the setopt counter reach 300 and this is the number of options that will ship in the pending 7.86.0 release. After 22 years we’ve added 241 new options, almost 11 new options per year on average.
Every new option comes with a cost: more code, more tests, more documentation and an even larger forest in which users can get lost when they try to figure out how to tell libcurl to behave the way the want it. The benefit of course being that libcurl gets one more capability and new chances to fulfill users’ wishes. New options certainly are both a blessing and a curse.
We have decided to never break existing behaviors, which means that we don’t remove old options – ever – but we may deprecate them. This also contributes to the large amount, as for many new options we add, we have documented that an older one should not be used but it still exists for backwards compatibility.
A benefit with using this API concept is that we can easily add new options without introducing new function calls.
A downside with using this API concept is that I made the function curl_easy_setopt
accept a “vararg”, meaning that the third argument passed to this function can be any type, and what type that is supposed to be used is dictated by the particular option that is used as the second argument.
While varargs is a cool C feature, it is bad in the sense that it takes away the compiler’s ability to check the argument at compile-time and instead makes it error-prone for users and forces libcurl to try to work around this limitation. If I would redo the API today, I would probably not do it exactly like this, as too many users shoot themselves in the foot with this.
Predicting what comes next is impossible, but if I were to guess I would say that we are likely to keep on adding options even in the future.
Looking back, we can see a fairly steady growth and I cannot see any recent developments in the project or in the surrounding ecosystems that would make us deviate from this path in the short term future at least.
Tldr; test and verify as much as possible also in the documentation.
I’m a sloppy typist. When I write several words in a row, like for example when creating complete sentences for something like a blog post, one or two of the words end up slightly misspelled.
Sure, many editors and systems have runtime spellchecks these days and they make it easy to quickly fix typos, but not all systems are like that and there are also situations where there are many false positives due to formatting or just the range of “special” words. They also rarely yell at me when I overuse the word “very” or start sentences with “But”.
I work fiercely on making the curl and libcurl documentation top-notch state of the art good and complete. I want my users to feel that. Everything is documented; clearly and with details and examples.
I want and aim for libcurl to be the best documented software library in the world.
Good documentation does not come for free or easily. It requires dedicated work and a lot of effort put into it.
This is of course a never-ending effort as things change over time and we have an almost ridiculous amount of options and details to document.
The key to improve ourselves is of course two good old classics: tests and CI jobs. This works great even for documentation, and perhaps in particular for technical documentation that includes lots of symbols and name references that need to be correct.
As I have recently worked on tightening some bolts and made it harder to land typos, I wanted to take the opportunity to describe some of our ways.
Early 2009 I had some interactions with people in the git project and we discussed their use of libcurl. As we introduce new features to curl over time, users who build with curl may want to write their code to conditionally use the new stuff if they have a new enough libcurl installed, or just skip those features if the installation is too old. git is an application like that. They use libcurl a lot and they offer to build with libcurl installations that are maybe a dozen years old.
I then created a file in the libcurl git repository that I named symbols-in-versions. It lists all publicly provided curl symbols and in which libcurl release they were introduced. A good resource for libcurl users. It took quite an effort to figure them all out after the fact.
Over time, the number of entries in this file has grown significantly.
Of course, in order to do good CI jobs, they need to have tests to run so we start there.
I will mention some test numbers below. The test numbers in curl do not have any inherent meaning, they are just unique identifiers. To help us find the test source files and refer to tests and their failures easily.
Test 1119 was introduced in November 2010 as I realized I needed to make sure that symbols-in-versions
(SIV) is kept up-to-date. It will be a useless document if it lags behind or misses symbols. It needs to include them all and the info needs to be correct.
I wrote a script that extracts all globally provided symbols in some curl header files and then verifies that they are all listed in SIV.
This test now made it very clear when we forgot to add a name to SIV, and it also pointed out if one of the names in SIV for example had a typo.
Scan SIV, figure out all existing options provided for three key libcurl functions: curl_easy_setopt, curl_multi_setopt and curl_easy_getinfo. Then verify that they all are mentioned in the respective “main” man page (curl_easy_setopt.3
etc), where they refer to the individual separate page for the option.
This test also verifies that the curl tool’s man page (curl.1
) lists exactly the same set of command line options as is listed in the tool’s source code file tool_getparam.c
and that is shown in the tool’s --help
output. Consistency is king.
To make sure the symbols we provide in libcurl header files all use the correct name space we created test 1167. Using the correct name space in this context means that all publicly provided symbols need to start with curl of libcurl, case insensitively. It is important for several reasons, first of course because a good library does not pollute the name space to risk collisions and problems, but also using the correct prefix is important so that test 1119 finds all the symbols correctly. So they need to use the right prefix, and when they do, they are scanned and verified correctly.
For libcurl we have several function calls that take options. In some cases these functions accept a very large amount of different options. Every such option is documented in its own dedicated man page. Over time, with lots of contributors working on the project, the different man pages were not all including the same information in the same order and a huge portion of them even missed one of the most important details in programming documentation: examples.
Test 1173 checks all libcurl man pages and verifies that they have the eight mandatory libcurl sections present (NAME, SYNOPSIS, DESCRIPTION etc) and that they all are in the right order and that there is an example section that is more than 2 lines.
This test also does basic nroff formatting verification so that we know the page will look decent in a man page viewer too.
Helps us greatly – especially when we add new man pages.
The tool that taught me to stop using the word “very” also finds a lot of other common bad takes on English is called proselint. Since a while back we run a CI job that runs proselint on all markdown files in the curl git repository. It helps us detect and edit away some amount of bad language.
At the time of this writing, there are 482 individual libcurl related man pages and there is a total of around 85,000 lines of documentation in the project. I decided we should run a spellcheck on these man pages in an attempt to reduce the number of typos and mistakes.
The CI job I created for this first strips out some sections from the man pages that we deem too hard to spellcheck: the SYNOPSIS and the EXAMPLES sections for example. The script also removes all names that look like public curl symbols, as spellchecking them with a normal spellchecker is just impossible and they need special treatment. See further below for that.
Finally, we convert the stripped man page versions into markdown – because we have no spellchecker tools for nroff – and then spellcheck those.
It took far many more hours than I had anticipated to eradicate all the spelling mistakes and we ended up with an custom dictionary with over 800 words that aspell does not like but that I insist are valid for us.
As I mentioned above, we strip out the curl symbols to hide them from the spellchecker.
Instead I extended the test 1119 mentioned above to also scan through all the libcurl man pages and find every single mention of something that looks like the name of a public curl symbol – and then match those against the names present in SIV and output an error if a symbol was referenced that was not documented already and therefore not actually a public curl symbol. With this, no man page can reference a non-existing curl symbol. Every such typo is detected.
No matter how hard we try, there will always be errors that sneak in anyway and there will be sentences and phrasing that might have felt good at the time of writing but later, in the view of someone else, do not communicate the right message or maybe mislead users to misunderstand functionality.
Bug reports on documentation is key to finding such warts so that we can correct them. In the curl project we make it as easy as possible to report bugs in documentation by providing direct links on virtually all man pages shown on the website. The link takes you directly to the “new issue” page with a template subject filled in with the man page’s name.
This convenience unfortunately leads to a certain amount of “issue spam” but I think that is still a fairly cheap price to pay.
The book is a treasure trove of additional and complementary curl documentation but it is actually written and maintained outside of the curl repository. It has its own set of CI tests, including proselint and spellchecks.
All these tests have been added gradually and slowly over a long period. It gives us time to polish and work out possible flaws in the tests and lets us make sure the work as intended and don’t block development.
I don’t have any immediate pending new pull requests for checking the curl documentation but if there still are details in there that we can check that we currently do not, I am sure that we will find those over time and make sure we verify them too.
If you have ideas and suggestions, I am all ears.
The dash-dash-libcurl is the sometimes missed curl gem that you might want to know about and occasionally maybe even use.
There is a very common pattern in curl land: a user who is writing an application using language L first does something successfully with the curl command line tool and then the person wants to convert that command line into the program they are writing.
How do you convert this curl command line into a transfer using programming language XYZ?
There is a huge amount of available bindings for libcurl. Bindings, as in adjustments and glue code for languages to use libcurl for Internet transfers so that they do not have to write the application in C.
At a recent count I found more than 60 libcurl bindings. They allow authors of virtually any programming language to do internet transfers using the powers of libcurl.
Most of these bindings are “thin” and follow the same style of the original C API: You create a handle for which you set options and then you do the transfer. This way they are fairly easy to keep up to date with the always-changing and always-improving libcurl releases.
--libcurl
Most curl command line options are mapped one-to-one to an underlying libcurl option so for some time I tried to help users by explaining what they map to. Until one day I realized that
Hey! curl itself already has this mapping in source code, it just needs a way to show it to users!
How would it best output this mapping?
The --libcurl
command line option was added to curl already back in 2007 and has been part of the tool since version 7.16.1.
It is really easy to use too.
--libcurl example.c
to the command line and run it again.If you want to use a libcurl binding rather than the C API, like perhaps write your code in Python or PHP, you need to convert the example code into your programming language but much thanks to the options keeping their names across the different bindings it is usually a trivial task.
The simplest possible command line:
curl curl.se
It gets HTTP from the site “curl.se”. If you try it, you will not see anything because it just replies with a redirect to the HTTPS version of the page. But ignoring that, you can convert that action into a libcurl program like this:
curl curl.se --libcurl code.c
The newly created file code.c
now contains a program that you can compile :
gcc -o getit code.c -lcurl
and then run
./getit
You might want your program to rather follow the redirect? Maybe even show debug output? Let’s rerun the command line and get a code update:
curl curl.se --verbose --location --libcurl code.c
Now, if you rebuild your program and run it again, it shows you the front page HTML of the curl website on stdout.
The exact code my curl version 7.85.0 produced in the command line above is shown below.
You see several options that are commented out. Those were used by the command line tool but there is no easy or convenient way to show their use in the example. Often you can start out by just skipping those .
The other day I sent out this tweet
As it took off, got an amazing attention and I received many different comments and replies, I felt a need to elaborate a little. To add some meat to this.
Is this string really a legitimate URL? What is a URL? How is it parsed?
http://http://http://@http://http://?http://#http://
Let’s start with curl. It parses the string as a valid URL and it does it like this. I have color coded the sections below a bit to help illustrate:
http://http://http://@http://http://?http://#http://
Going from left to right.
http – the scheme of the URL. This speaks HTTP. The “://” separates the scheme from the following authority part (that’s basically everything up to the path).
http – the user name up to the first colon
//http:// – the password up to the at-sign (@)
http: – the host name, with a trailing colon. The colon is a port number separator, but a missing number will be treated by curl as the default number. Browsers also treat missing port numbers like this. The default port number for the http scheme is 80.
//http:// – the path. Multiple leading slashes are fine, and so is using a colon in the path. It ends at the question mark separator (?).
http:// – the query part. To the right of the question mark, but before the hash (#).
http:// – the fragment, also known as “anchor”. Everything to the right of the hash sign (#).
To use this URL with curl and serve it on your localhost try this command line:
curl "http://http://http://@http://http://?http://#http://" --resolve http:80:127.0.0.1
The curl parser has been around for decades already and do not break existing scripts and applications is one of our core principles. Thus, some of the choices in the URL parser we did a long time ago and we stand by them, even if they might be slightly questionable standards wise. As if any standard meant much here.
The curl parser started its life as lenient as it could possibly be. While it has been made stricter over the years, traces of the original design still shows. In addition to that, we have also felt that we have been forced to make adaptions to the parser to make it work with real world URLs thrown at it. URLs that maybe once was not deemed fine, but that have become “fine” because they are accepted in other tools and perhaps primarily in browsers.
I have mentioned it many times before. What we all colloquially refer to as a URL is not something that has a firm definition:
There is the URI (not URL!) definition from IETF RFC 3986, there is the WHATWG URL Specification that browsers (try to) adhere to and there are numerous different implementations of parsers being more or less strict when following one or both of the above mentioned specifications.
You will find that when you scrutinize them into the details, hardly any two URL parsers agree on every little character.
Therefore, if you throw the above mentioned URL on any random URL parser they may reject it (like the Twitter parser didn’t seem to think it was a URL) or they might come to a different conclusion about the different parts than curl does. In fact, it is likely that they will not do exactly as curl does.
April King threw it at Python’s urllib. A valid URL! While it accepted it as a URL, it split it differently:
ParseResult(scheme='http', netloc='http:', path='//http://@http://http://', params='', query='http://', fragment='http://')
Given the replies to my tweet, several other parsers did the similar interpretation. Possibly because they use the same underlying parser…
Meduz showed how the JavaScript URL object treats it, and it looks quite similar to the Python above. Still a valid URL.
I added the host entry ‘127.0.0.1 http
‘ into /etc/hosts and pasted the URL into the address bar of Firefox. It rewrote it to
http://http//http://@http://http://?http://#http://
(the second colon from the left is removed, everything else is the same)
… but still considered it a valid URL and showed the contents from my web server.
Chrome behaved exactly the same. A valid URL according to it, and a rewrite of the URL like Firefox does.
Some commenters mentioned that the unencoded “unreserved” letters in the authority part make it not RFC 3986 compliant. Section 3.2 says:
The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.
Meaning that the password should have its slashes URL-encoded as %2f to make it valid. At least. So maybe not a valid URL?
Update: it actually still qualifies as “valid”, it just is parsed a little differently than how curl does it. I do not think there is any question that curl’s interpretation is not matching RFC 3986.
The URL works equally fine with https.
https://https://https://@https://https://?https://#https://
The two reasons I did not use https in the tweet:
curl -k
or similar to make the host name ‘https’ be OK vs the name used in the actual server you would target.A surprisingly large number of people thought it reminded them of the old buffalo buffalo thing.
This is a tale of cookies, Internet code and a CVE. It goes back a long time so please take a seat, lean back and follow along.
The scene is of course curl, the internet transfer tool and library I work on.
In October 1998 we shipped curl 4.9. In 1998. Few people had heard of curl or used it back then. This was a few months before the curl website would announce that curl achieved 300 downloads of a new release. curl was still small in every meaning of the word at that time.
curl 4.9 was the first release that shipped with the “cookie engine”. curl could then receive HTTP cookies, parse them, understand them and send back cookies properly in subsequent requests. Like the browsers did. I wrote the bigger part of the curl code for managing cookies.
In 1998, the only specification that existed and described how cookies worked was a very brief document that Netscape used to host called cookie_spec
. I keep a copy of that document around for curious readers. It really does not document things very well and it leaves out enormous amounts information that you had to figure out by inspecting other clients.
The cookie code I implemented than was based on that documentation and what the browsers seemed to do at the time. It seemed to work with numerous server implementations. People found good use for the feature.
This decade passed with a few separate efforts in the IETF to create cookie specifications but they all failed. The authors of these early cookie specs probably thought they could create standards and the world would magically adapt to them, but this did not work. Cookies are somewhat special in the regard that they are implemented by so many different authors, code bases and websites that fundamentally changing the way they work in a “decree from above” like that is difficult if not downright impossible.
Finally, in 2011 there was a cookie rfc published! This time with the reversed approach: it primarily documented and clarified how cookies were actually already being used.
I was there and I helped it get made by proving my views and opinions. I did not agree to everything that the spec includes (you can find blog posts about some of those details), but finally having a proper spec was still a huge improvement to the previous state of the world.
What did not bother me much at the time, but has been giving me a bad rash ever since, is the peculiar way the spec is written: it provides one field syntax for how servers should send cookies, and a different one for what syntax clients should accept for cookies.
Two syntax for the same cookies.
This has at least two immediate downsides:
Since a few years back, there is ongoing work in IETF on revising and updating the cookie spec of 2011. Things have evolved and some extensions to cookies have been put into use in the world and deserves to be included in the spec. If you would to implement code today that manage cookies, the old RFC is certainly not enough anymore. This cookie spec update work is called 6265bis.
curl is up to date and compliant with what the draft versions of RFC 6265bis say.
The issue about the double syntax from above is still to be resolved in the document, but I faced unexpectedly tough resistance when I recently shared my options and thoughts about that spec peculiarity.
It can be noted that fundamentally, cookies still work the same way as they did back in 1998. There are added nuances and knobs sure, but the basic principles have remained. And will so even in the cookie spec update.
One of oddities of cookies is that they don’t work on origins like most other web features do.
While cookies have evolved slowly over time, the HTTP specs have also been updated and refreshed a few times over the decades, but perhaps even more importantly the HTTP server implementations have implemented stricter parsing policies as they have (together with the rest of the world) that being liberal in what you accept (Postel’s law) easily lead to disasters. Like the dreaded and repeated HTTP request tunneling/smuggling attacks have showed us.
To combat this kind of attack, and probably to reduce the risk of other issues as well, HTTP servers started to reject incoming HTTP requests early if they appear “illegal” or malformed. Block them already at the door and not letting obvious crap in. In particular this goes for control codes in requests. If you try to send a request to a reasonably new HTTP server today that contains a control code, chances are very high that the server will reject the request and just return a 400 response code.
With control code I mean a byte value between 1 and 31 (excluding 9 which is TAB)
The well known HTTP server Apache httpd has this behavior enabled by default since 2.4.25, shipped in December 2016. Modern nginx versions seem to do this as well, but I have not investigated since exactly when.
If cookies were designed today for the first time, they certainly would be made different.
A website that sets cookies sends cookies to the client. For each cookie it sends, it sets a number of properties for the cookie. In particular it sets matching parameters for when the cookie should be sent back again by the client.
One of these cookie parameters set for a cookie is the domain that need to match for the client to send it. A server that is called www.example.com
can set a cookie for the entire example.com
domain, meaning that the cookie will then be sent by the client also when visiting second.example.com
. Servers can set cookies for “sibling sites!
The cookie code added to curl in 1998 was quite liberal in what content it accepted and while it was of course adjusted and polished over the years, it was working and it was compatible with real world websites.
The main driver for changes in that area of the code has always been to make sure that curl works like and interoperates with other cookie-using agents out in the wild.
In the end of June 2022 we received a report of a suspected security problem in curl, that would later result in our publication of CVE-2022-35252.
As it turned out, the old cookie code from 1998 accepted cookies that contained control codes. The control codes could be part of the name or the the content just fine, and if the user enabled the “cookie engine” curl would store those cookies and send them back in subsequent requests.
Example of a cookie curl would happily accept:
Set-Cookie: name^a=content^b; domain=.example.com
The ^a and ^b represent control codes, byte code one and two. Since the domain can mark the cookie for another host, as mentioned above, this cookie would get included for requests to all hosts within that domain.
When curl sends a cookie like that to a HTTP server, it would include a header field like this in its outgoing request:
Cookie: name^a=content^b
… to which a default configure Apache httpd and other servers will respond 400. For a script or an application that received theses cookies, further requests will be denied for as long as the cookies keep getting sent. A denial of service.
The client side part of RFC 6265, section 5.2 is not easy to decipher and figuring out that a client should discard cookies with control cookies requires deep studies of the document. There is in fact no mention of “control codes” or this byte range in the spec. I suppose I am just a bad spec reader.
It is actually easier to spot what the popular browsers do since their source codes are easily available, and it turns out of course that both Chrome and Firefox already ignore incoming cookies that contain any of the bytes
%01-%08 / %0b-%0c / %0e-%1f / %7f
The range does not include %09, which is TAB and %0a / %0d which are line endings.
The curl fix was not too surprisingly and quite simply to refuse cookie fields that contain one or more of those banned byte values. As they are not accepted by the browser’s already, the risk that any legitimate site are using them for any benign purpose is very slim and I deem this change to be nearly risk-free.
The vulnerable code has been in curl versions since version 4.9 which makes it exactly 8,729 days (23.9 years) until the shipped version 7.85.0 that fixed it. It also means that we introduced the bug on project day 201 and fixed it on day 8,930.
The code was not problematic when it shipped and it was not problematic during a huge portion of the time it has been used by a large amount of users.
It become problematic when HTTP servers started to refuse HTTP requests they suspected could be malicious. The way this code turned into a denial of service was therefore more or less just collateral damage. An unfortunate side effect.
Maybe the bug was born first when RFC 6265 was published. Maybe it was born when the first widely used HTTP server started to reject these requests.
8,729 days is a new project record age for a CVE to have been present in the code until found. It is still the forth CVE that were lingering around for over 8,000 days until found.
Thanks to Stefan Eissing for digging up historic Apache details.
Axel Chong submitted the CVE-2022-35252 report.
Campfire image by Martin Winkler from Pixabay
Every human has a unique fingerprint. With only an impression of a person’s fingertip, it is possible to follow the lead back to the single specific individual wearing that unique pattern.
The phrase TLS fingerprint is of course in this spirit. A pattern in a TLS handshake that allows an involved party to tell or at least guess with a certain level of accuracy what client software that performed it – purely based on how exactly the TLS magic is done. There are numerous different ways and variations a client can perform a TLS handshake and still be standards compliant. There is a long list of extensions that can vary in content, the order of the list of extensions, the ciphers to accept, the allowed TLS versions, steps performed, the order and sequence of those steps and more.
When a network client connects to a remote site and makes a TLS handshake with the server, the server can basically add up all those details and make an educated guess exactly which client that connects to it. One method to do it is called JA3 and produces a 32 digit hexadecimal number as output. (The three creators of this algorithm all have JA as their initials!)
I have recently talked with customers and users who have faced servers that refused them access when they connected to sites using curl, but allowed them access to the site when they instead use one of the popular browsers – or if curl was tweaked to look like one of those browsers. It might be a trend in the making. There might be more sites out there now that reject clients that produce the wrong fingerprint then there used to be.
Presumably there are many reasons why servers want to limit access to a subset of clients, but I think the general idea is that they want to prevent “illegitimate” user agents from accessing their sites.
For example, I have seen online market sites use this method in an what I have perceived as an attempt to block bots and web scrapers. Or they do it to block malware or other hostile clients that scour their website.
There’s this JA3 page that shows lots of implementations for many services that can figure out clients’ TLS fingerprints and act on them. And there’s nothing that says you have to do it with JA3. There’s likely to be numerous other ways and algorithms as well.
There are also companies that offer commercial services to filter off mismatching clients from your site. This is real business.
In the earlier days of the web, web sites used more basic ways to detect and filter out bots and non-browser user clients. The original and much simpler way is to check the User-Agent:
field that HTTP clients pass on, but has also sometimes been extended to check the order of the sent HTTP headers and in some cases, servers have used elaborate JavaScript schemes in order to try to “smoke out” the clients that don’t seem to act like full-fledged browsers.
If the clients use HTTP/2, that too allows for more details to fingerprint.
As the web has transitioned over to almost exclusively use HTTPS, it has severely increased the ways a server can fingerprint clients, and at the same time made it harder for non-browser clients to look exactly like browsers.
Sites that use TLS fingerprints to allow access, of course do not want too many false positives. They want to allow all “normal” browser-based visitors, even if they use a little older versions and also if they use somewhat older or less common operating systems.
This means that they either have to work hard to get an extensive list of acceptable hashes in an accept list or they add known non-desired clients in a block list. I would imagine that if you go the accept list route, that’s how companies can sell this services as that is maintenance intensive work.
Users of alternative and niche browsers are sometimes also victims in this scheme if they stand out enough.
The TLS fingerprints have the interesting feature compared to human fingertip prints, that they are the result of a set of deliberate actions and not just a pattern you are born to wear. They are therefore a lot easier to change.
With curl version C using TLS library T of version V, the TLS fingerprint is a function that involves C, T and V. And the options O set by curl. By changing one or more of those variables, you are likely to alter the TLS fingerprint.
To be most certain that no site will reject your curl request because of its TLS fingerprint, you would alter the print to look exactly like the one of a popular browser. You can suspect that most sites want their regular human browser-using visitors to be able to access them.
To make curl look exactly like a browser you also likely need to do more than just change C, O, T and V from the section above. You also need to make sure that the TLS library you use produces its lists of extensions and ciphers in exactly the same order etc. This may require that you alter options and maybe even source code.
This is a custom build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate performs TLS and HTTP handshakes that are identical to that of a real browser.
curl-impersonate is a modified curl build and the project also provides docker images and more to help users to use it easily.
I cannot say right now if any of the changes done for curl-impersonate will get merged into the upstream curl project, but it will also depend on what users want and how the use of TLS fingerprinting spread or changes going forward.
Another popular way to work around this kind of blocking is to simply program a browser to do the job. Either a headless browser or with tools like Selenium. Since these methods make the TLS handshake using a browser “engine”, they are unlikely to get blocked by these filters.
Servers add more hurdles to attempt to block unwanted clients.
Clients change to keep up with the servers and to still access the sites in spite of what the server admins want.
As early as only a few years ago I had never heard of any site that blocked clients because of their TLS handshake. Through recent years I have seen it happen and the use of it seems to have increased. I don’t know of any way to measure if this is actually true or just my feeling.
I cannot rule out that we are going to see this more going forward, even if I also believe that the work on circumventing these fingerprinting filters is just getting started. If the circumvention grows and becomes easy enough, maybe it will stifle servers from adding these filters as they will not be effective anyway?
Let us come back to this topic in a few years and see where it went.
Welcome to a new curl release, the result of a slightly extend release cycle this time.
the 210th release
3 changes
65 days (total: 8,930)
165 bug-fixes (total: 8,145)
230 commits (total: 29,017)
0 new public libcurl function (total: 88)
2 new curl_easy_setopt() option (total: 299)
0 new curl command line option (total: 248)
79 contributors, 38 new (total: 2,690)
44 authors, 22 new (total: 1,065)
1 security fixes (total: 126)
Bug Bounties total: 40,900 USD
We have yet another CVE to disclose.
CVE-2022-35252 allows a server to send cookies to curl that contain ASCII control codes. When such cookies subsequently are sent back to a server, they will cause 400 responses from servers that downright refuse such requests. Severity: low. Reward: 480 USD.
This release counts three changes. They are:
For everyone who uses this backend (which include everyone who uses the curl that Microsoft bundles with Windows) this is great news: now you too can finally use TLS 1.3 with curl. Assuming that you use a new enough version of Windows 10/11 that has the feature present. Let’s hope Microsoft updates the bundled version soon.
These are two new options meant to replace and be used instead of the options with the same names without the “_STR
” extension.
While working on support for new future protocols for libcurl to deal with, we realized that the old options were filled up and there was no way we could safely extend them with additional entries. These new functions instead work on text input and have no limit in number of protocols they can be made to support.
The ngtcp2 backend can now also be built to use wolfSSL for the TLS parts.
This was yet again a cycle packed with bugfixes. Here are some of my favorites:
Doing proper and complete memory cleanup even when we exist due to out of memory is sometimes difficult. I found and fixed this very old bug.
The script that generates the curl.1 man page from all its sub components was improved and now typically executes several times faster then before. curl developers all over rejoice.
Previously, the configure would instead just silently switch off TLS support which was not always easy to spot and would lead to users going further before they eventually realize this.
The configure macro that checks for size of variable types was rewritten. It was the only piece left in the source tree that had the mention of GPL left. The license did not affect the product source code or the built outputs, but it caused questions and therefore some friction we could easily avoid by me completely writing away the need for the license mention.
A silly memory-leak when doing HTTP/3 connections on dual-stack machines.
Another one of those rarely used and tiny little details about following what the spec says.
This, and several other smaller fixes together improved the atomics support in curl quite a lot since the previous version. We conditionally use this C11 feature if present to make the library initialization function thread-safe without requiring a separate library for it.
There were several fixes and cleanups done in the digest department this time around.
Another “victim” of the new license awareness in the project. This example was the only file present in the repository using this special license, and since it was also a bit convoluted example we decided it did not really have to be included.
curl is now slightly more compliant with RFC 6761, follows in the browsers’ footsteps and resolves all host names in the “.localhost” domain to the fixed localhost addresses.
curl built with hyper now also supports “folded” HTTP/1 headers.
Coverity had an update in August and immediately pointed out these two long-standing bugs – in two separate SSH backends – related to time stamps and 32 bits.
When an applications sets the CONNECT_ONLY option for a transfer within a multi stack, that connection was not properly closed until the whole multi handle was closed even if the associated easy handle was terminated. This lead to connections being kept around unnecessarily long (and wasting resources).
Apparently those platform likes to close socketpairs when the application is pushed into the background, while pipes survive the same happening… This is a change that might be preferred for other platforms as well going forward.
The hash table used for the DNS cache is now made larger for the multi interface than when created to be used by the easy interface, as it simply is more likely to be used by many host names then and then it performs better like this.
URLs actually have no actual maximum size in any spec and neither does the host name within one, but the maximum length of a DNS name is 253 characters. Assuming you can resolve the name without DNS, another length limit is the SNI field in TLS that is an unsigned 16 bit number: 65535 bytes. This implies that clients cannot connect to any SNI-using TLS protocol with a longer name. Instead of checking for that limit in many places, it is now done early.
As part of the repeated iterative work of making sure the structs are kept as small as possible, we have again reduced the size of numerous struct fields and rearranged the order somewhat to save memory.
The next release is planned to ship on October 25, 2022.
A little over a year ago, we ditched Travis CI as a service to use for the curl project.
Up until that day, it had been our preferred and favored CI service for many years. At most, we ran 34 CI jobs on it, for every pull-request and commit. It was the service that we leaned on when we transitioned the curl project into a CI-heavy user. Our use of CI really took off 2017 and has been increasing ever since.
We abruptly cut off over 30 jobs from the service just one day. At the time, that was a third of all our CI. The CI jobs that we rely on to verify our work and to keep things working and stable in the project.
At the time of the amputation, we run 99 CI jobs distributed over 5 services, so even with one of them cut off we still ran jobs on AppVeyor, Azure Pipelines, GitHub Actions and Cirrus CI. We were not completely stranded.
Luckily for us, when one solution goes sour there are often alternatives out there that we can move over to and continue our never-ending path forward.
In our case, friendly people helped moving over almost all ex-Travis jobs to the (for us) new service Zuul CI. In July 2021, we had 29 jobs on it. We also added Circle CI to the mix and started running jobs there.
In July 2021 – a month after the cut – we counted 96 running jobs (a few old jobs were just dropped as we reconsidered their value). While the work involved a lot of adjusting scripts, pulling hair over yaml files and more, it did not cause any significant service loss over an extended period. We managed pretty good.
There was no noticeable glitch in quality or backed up “guilt” in the project because of the transition and small period of lesser CI either. Thanks to the other services still running, we were still in a good shape.
In the end of August 2022 we still use 6 different CI services and we now run 113 CI jobs on them, for every push to master and to pull-requests.
There are primarily three reasons why we still use a variety of services.
Over the year since the amputation, we have learned that our new friend Zuul CI has turned out to not work quite as reliable and convenient as we would like it. Since a few months back, we are now gradually moving away from this service. Slowly moving over jobs from there to run on one the other five instead.
Over time, our new most preferred CI service has turned out to be GitHub Actions. At the latest count, it now run 44 CI jobs for us. We still have 12 jobs on Zuul targeted for transition.
Services come and go. We have different ideas and our requirements and ambitions change. I am sure we will continue to service-jump when needed. It is just a natural development. A part of a software development life.
A big challenge and hurdle with our CI setup remains: to maintain the builds and keep them stable and functioning. With over a hundred jobs running on six services and our code and test suite being portable and things being networked and running on many platforms, it is job that we quite often fail at. It has turned out mighty difficult to avoid that at least a few of the jobs are constantly red, “permafailing”, at any single moment.
If this is stuff you like to tinker with, we could use your help!