The first 300 setopts

Already when the first version of curl shipped in 1998, I had plans and ideas in the back of my head to turn it to a library at some point. I had already before worked on providing libraries with APIs for applications and I appreciated their powers.

During the summer of 2000 I refactored the curl internals so that it would become a library with an exposed API that we could provide to the world and then let applications get the same file transfer capabilities that the curl command line tool has.

libwww

I was not aware of any existing library alternative that provided a plain transfer-oriented functionality. There was libwww, but that seemed to have a rather different focus and other users in mind. I wanted something simpler.

ioctl

I found the inspiration for the libcurl *setopt() concept in how ioctl() and fcntl() work. They set options for generic file descriptors. A primary idea would be to not have to add new function calls or change the API when we invent new options that can be set.

easy

As I designed the first functions for libcurl, I anticipated that we perhaps would want to add a more advanced API at a later point. The first take would make a straight forward way to synchronous internet transfers. As this was the initial basic API I decided to call it the “easy” interface. Several of the functions in libcurl are hence prefixed with “curl_easy”.

curl_easy_setopt() became the foundational function in the libcurl API. The one that sets “options” for a libcurl easy handle.:

CURLcode curl_easy_setopt(CURL *handle,
                          CURLoption option,   
                          parameter);

We called the first libcurl version 7.1 in August 2000. I decided to skip 7.0 completely just to avoid confusions as I had shipped a series of pre-releases using that version number.

libcurl version 7.1 supported 59 different options for curl_easy_setopt. They were basically all the command line options existing at the time converted to API mechanisms, and then the command line options were mapped to those options. In many ways that mapping has continued since then, as the command line tool remains to a large extent a wrapper to allow the command line tool to set and use the necessary libcurl options.

Growth

It took four years to double the amount of options and ten years alter the official count was at 180.

Today, in September 2022, we recently merged code that made the setopt counter reach 300 and this is the number of options that will ship in the pending 7.86.0 release. After 22 years we’ve added 241 new options, almost 11 new options per year on average.

Every new option comes with a cost: more code, more tests, more documentation and an even larger forest in which users can get lost when they try to figure out how to tell libcurl to behave the way the want it. The benefit of course being that libcurl gets one more capability and new chances to fulfill users’ wishes. New options certainly are both a blessing and a curse.

Deprecating

We have decided to never break existing behaviors, which means that we don’t remove old options – ever – but we may deprecate them. This also contributes to the large amount, as for many new options we add, we have documented that an older one should not be used but it still exists for backwards compatibility.

Downsides

A benefit with using this API concept is that we can easily add new options without introducing new function calls.

A downside with using this API concept is that I made the function curl_easy_setopt accept a “vararg”, meaning that the third argument passed to this function can be any type, and what type that is supposed to be used is dictated by the particular option that is used as the second argument.

While varargs is a cool C feature, it is bad in the sense that it takes away the compiler’s ability to check the argument at compile-time and instead makes it error-prone for users and forces libcurl to try to work around this limitation. If I would redo the API today, I would probably not do it exactly like this, as too many users shoot themselves in the foot with this.

Future

Predicting what comes next is impossible, but if I were to guess I would say that we are likely to keep on adding options even in the future.

Looking back, we can see a fairly steady growth and I cannot see any recent developments in the project or in the surrounding ecosystems that would make us deviate from this path in the short term future at least.

Taking curl documentation quality up one more notch

Tldr; test and verify as much as possible also in the documentation.

I’m a sloppy typist. When I write several words in a row, like for example when creating complete sentences for something like a blog post, one or two of the words end up slightly misspelled.

Sure, many editors and systems have runtime spellchecks these days and they make it easy to quickly fix typos, but not all systems are like that and there are also situations where there are many false positives due to formatting or just the range of “special” words. They also rarely yell at me when I overuse the word “very” or start sentences with “But”.

curl documentation

I work fiercely on making the curl and libcurl documentation top-notch state of the art good and complete. I want my users to feel that. Everything is documented; clearly and with details and examples.

I want and aim for libcurl to be the best documented software library in the world.

Good documentation does not come for free or easily. It requires dedicated work and a lot of effort put into it.

This is of course a never-ending effort as things change over time and we have an almost ridiculous amount of options and details to document.

The key to improve ourselves is of course two good old classics: tests and CI jobs. This works great even for documentation, and perhaps in particular for technical documentation that includes lots of symbols and name references that need to be correct.

As I have recently worked on tightening some bolts and made it harder to land typos, I wanted to take the opportunity to describe some of our ways.

symbols-in-versions

Early 2009 I had some interactions with people in the git project and we discussed their use of libcurl. As we introduce new features to curl over time, users who build with curl may want to write their code to conditionally use the new stuff if they have a new enough libcurl installed, or just skip those features if the installation is too old. git is an application like that. They use libcurl a lot and they offer to build with libcurl installations that are maybe a dozen years old.

I then created a file in the libcurl git repository that I named symbols-in-versions. It lists all publicly provided curl symbols and in which libcurl release they were introduced. A good resource for libcurl users. It took quite an effort to figure them all out after the fact.

Over time, the number of entries in this file has grown significantly.

Tests

Of course, in order to do good CI jobs, they need to have tests to run so we start there.

I will mention some test numbers below. The test numbers in curl do not have any inherent meaning, they are just unique identifiers. To help us find the test source files and refer to tests and their failures easily.

Test 1119

Test 1119 was introduced in November 2010 as I realized I needed to make sure that symbols-in-versions (SIV) is kept up-to-date. It will be a useless document if it lags behind or misses symbols. It needs to include them all and the info needs to be correct.

I wrote a script that extracts all globally provided symbols in some curl header files and then verifies that they are all listed in SIV.

This test now made it very clear when we forgot to add a name to SIV, and it also pointed out if one of the names in SIV for example had a typo.

Test 1139

Scan SIV, figure out all existing options provided for three key libcurl functions: curl_easy_setopt, curl_multi_setopt and curl_easy_getinfo. Then verify that they all are mentioned in the respective “main” man page (curl_easy_setopt.3 etc), where they refer to the individual separate page for the option.

This test also verifies that the curl tool’s man page (curl.1) lists exactly the same set of command line options as is listed in the tool’s source code file tool_getparam.c and that is shown in the tool’s --help output. Consistency is king.

Test 1167

To make sure the symbols we provide in libcurl header files all use the correct name space we created test 1167. Using the correct name space in this context means that all publicly provided symbols need to start with curl of libcurl, case insensitively. It is important for several reasons, first of course because a good library does not pollute the name space to risk collisions and problems, but also using the correct prefix is important so that test 1119 finds all the symbols correctly. So they need to use the right prefix, and when they do, they are scanned and verified correctly.

Test 1173

For libcurl we have several function calls that take options. In some cases these functions accept a very large amount of different options. Every such option is documented in its own dedicated man page. Over time, with lots of contributors working on the project, the different man pages were not all including the same information in the same order and a huge portion of them even missed one of the most important details in programming documentation: examples.

Test 1173 checks all libcurl man pages and verifies that they have the eight mandatory libcurl sections present (NAME, SYNOPSIS, DESCRIPTION etc) and that they all are in the right order and that there is an example section that is more than 2 lines.

This test also does basic nroff formatting verification so that we know the page will look decent in a man page viewer too.

Helps us greatly – especially when we add new man pages.

Proselint

The tool that taught me to stop using the word “very” also finds a lot of other common bad takes on English is called proselint. Since a while back we run a CI job that runs proselint on all markdown files in the curl git repository. It helps us detect and edit away some amount of bad language.

Spellcheck

At the time of this writing, there are 482 individual libcurl related man pages and there is a total of around 85,000 lines of documentation in the project. I decided we should run a spellcheck on these man pages in an attempt to reduce the number of typos and mistakes.

The CI job I created for this first strips out some sections from the man pages that we deem too hard to spellcheck: the SYNOPSIS and the EXAMPLES sections for example. The script also removes all names that look like public curl symbols, as spellchecking them with a normal spellchecker is just impossible and they need special treatment. See further below for that.

Finally, we convert the stripped man page versions into markdown – because we have no spellchecker tools for nroff – and then spellcheck those.

It took far many more hours than I had anticipated to eradicate all the spelling mistakes and we ended up with an custom dictionary with over 800 words that aspell does not like but that I insist are valid for us.

Verify curl symbols

As I mentioned above, we strip out the curl symbols to hide them from the spellchecker.

Instead I extended the test 1119 mentioned above to also scan through all the libcurl man pages and find every single mention of something that looks like the name of a public curl symbol – and then match those against the names present in SIV and output an error if a symbol was referenced that was not documented already and therefore not actually a public curl symbol. With this, no man page can reference a non-existing curl symbol. Every such typo is detected.

Links for reporting on docs bugs

No matter how hard we try, there will always be errors that sneak in anyway and there will be sentences and phrasing that might have felt good at the time of writing but later, in the view of someone else, do not communicate the right message or maybe mislead users to misunderstand functionality.

Bug reports on documentation is key to finding such warts so that we can correct them. In the curl project we make it as easy as possible to report bugs in documentation by providing direct links on virtually all man pages shown on the website. The link takes you directly to the “new issue” page with a template subject filled in with the man page’s name.

This convenience unfortunately leads to a certain amount of “issue spam” but I think that is still a fairly cheap price to pay.

Everything curl

The book is a treasure trove of additional and complementary curl documentation but it is actually written and maintained outside of the curl repository. It has its own set of CI tests, including proselint and spellchecks.

Further

All these tests have been added gradually and slowly over a long period. It gives us time to polish and work out possible flaws in the tests and lets us make sure the work as intended and don’t block development.

I don’t have any immediate pending new pull requests for checking the curl documentation but if there still are details in there that we can check that we currently do not, I am sure that we will find those over time and make sure we verify them too.

If you have ideas and suggestions, I am all ears.

Related

Making world-class docs takes effort

convert a curl cmdline to libcurl source code

The dash-dash-libcurl is the sometimes missed curl gem that you might want to know about and occasionally maybe even use.

How do I convert

There is a very common pattern in curl land: a user who is writing an application using language L first does something successfully with the curl command line tool and then the person wants to convert that command line into the program they are writing.

How do you convert this curl command line into a transfer using programming language XYZ?

Language bindings

There is a huge amount of available bindings for libcurl. Bindings, as in adjustments and glue code for languages to use libcurl for Internet transfers so that they do not have to write the application in C.

At a recent count I found more than 60 libcurl bindings. They allow authors of virtually any programming language to do internet transfers using the powers of libcurl.

Most of these bindings are “thin” and follow the same style of the original C API: You create a handle for which you set options and then you do the transfer. This way they are fairly easy to keep up to date with the always-changing and always-improving libcurl releases.

Say hi to --libcurl

Most curl command line options are mapped one-to-one to an underlying libcurl option so for some time I tried to help users by explaining what they map to. Until one day I realized that

Hey! curl itself already has this mapping in source code, it just needs a way to show it to users!

How would it best output this mapping?

The --libcurl command line option was added to curl already back in 2007 and has been been part of the tool since version 7.16.1.

Show this command line as libcurl code

It is really easy to use too.

  1. Create a complicated curl command line that does what you need it to do.
  2. Append --libcurl example.c to the command line and run it again.
  3. Inspect your newly generated file example.c for how you could write your application to do the same thing with libcurl.

If you want to use a libcurl binding rather than the C API, like perhaps write your code in Python or PHP, you need to convert the example code into your programming language but much thanks to the options keeping their names across the different bindings it is usually a trivial task.

Example

The simplest possible command line:

curl curl.se

It gets HTTP from the site “curl.se”. If you try it, you will not see anything because it just replies with a redirect to the HTTPS version of the page. But ignoring that, you can convert that action into a libcurl program like this:

curl curl.se --libcurl code.c

The newly created file code.c now contains a program that you can compile :

gcc -o getit code.c -lcurl

and then run

./getit

You might want your program to rather follow the redirect? Maybe even show debug output? Let’s rerun the command line and get a code update:

curl curl.se --verbose --location --libcurl code.c

Now, if you rebuild your program and run it again, it shows you the front page HTML of the curl website on stdout.

The code

The exact code my curl version 7.85.0 produced in the command line above is shown below.

You see several options that are commented out. Those were used by the command line tool but there is no easy or convenient way to show their use in the example. Often you can start out by just skipping those .

http://http://http://@http://http://?http://#http://

The other day I sent out this tweet

"http://http://http://@http://http://?http://#http://"

is a legitimate URL

As it took off, got an amazing attention and I received many different comments and replies, I felt a need to elaborate a little. To add some meat to this.

Is this string really a legitimate URL? What is a URL? How is it parsed?

http://http://http://@http://http://?http://#http://

curl

Let’s start with curl. It parses the string as a valid URL and it does it like this. I have color coded the sections below a bit to help illustrate:

http://http://http://@http://http://?http://#http://

Going from left to right.

http – the scheme of the URL. This speaks HTTP. The “://” separates the scheme from the following authority part (that’s basically everything up to the path).

http – the user name up to the first colon

//http:// – the password up to the at-sign (@)

http: – the host name, with a trailing colon. The colon is a port number separator, but a missing number will be treated by curl as the default number. Browsers also treat missing port numbers like this. The default port number for the http scheme is 80.

//http:// – the path. Multiple leading slashes are fine, and so is using a colon in the path. It ends at the question mark separator (?).

http:// – the query part. To the right of the question mark, but before the hash (#).

http:// – the fragment, also known as “anchor”. Everything to the right of the hash sign (#).

To use this URL with curl and serve it on your localhost try this command line:

curl "http://http://http://@http://http://?http://#http://" --resolve http:80:127.0.0.1

The curl parser

The curl parser has been around for decades already and do not break existing scripts and applications is one of our core principles. Thus, some of the choices in the URL parser we did a long time ago and we stand by them, even if they might be slightly questionable standards wise. As if any standard meant much here.

The curl parser started its life as lenient as it could possibly be. While it has been made stricter over the years, traces of the original design still shows. In addition to that, we have also felt that we have been forced to make adaptions to the parser to make it work with real world URLs thrown at it. URLs that maybe once was not deemed fine, but that have become “fine” because they are accepted in other tools and perhaps primarily in browsers.

URL standards

I have mentioned it many times before. What we all colloquially refer to as a URL is not something that has a firm definition:

There is the URI (not URL!) definition from IETF RFC 3986, there is the WHATWG URL Specification that browsers (try to) adhere to and there are numerous different implementations of parsers being more or less strict when following one or both of the above mentioned specifications.

You will find that when you scrutinize them into the details, hardly any two URL parsers agree on every little character.

Therefore, if you throw the above mentioned URL on any random URL parser they may reject it (like the Twitter parser didn’t seem to think it was a URL) or they might come to a different conclusion about the different parts than curl does. In fact, it is likely that they will not do exactly as curl does.

Python’s urllib

April King threw it at Python’s urllib. A valid URL! While it accepted it as a URL, it split it differently:

ParseResult(scheme='http',
            netloc='http:',
            path='//http://@http://http://',
            params='',
            query='http://',
            fragment='http://')

Given the replies to my tweet, several other parsers did the similar interpretation. Possibly because they use the same underlying parser…

Javacript

Meduz showed how the JavaScript URL object treats it, and it looks quite similar to the Python above. Still a valid URL.

Firefox and Chrome

I added the host entry ‘127.0.0.1 http‘ into /etc/hosts and pasted the URL into the address bar of Firefox. It rewrote it to

http://http//http://@http://http://?http://#http://

(the second colon from the left is removed, everything else is the same)

… but still considered it a valid URL and showed the contents from my web server.

Chrome behaved exactly the same. A valid URL according to it, and a rewrite of the URL like Firefox does.

RFC 3986

Some commenters mentioned that the unencoded “unreserved” letters in the authority part make it not RFC 3986 compliant. Section 3.2 says:

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

Meaning that the password should have its slashes URL-encoded as %2f to make it valid. At least. So maybe not a valid URL?

Update: it actually still qualifies as “valid”, it just is parsed a little differently than how curl does it. I do not think there is any question that curl’s interpretation is not matching RFC 3986.

HTTPS

The URL works equally fine with https.

https://https://https://@https://https://?https://#https://

The two reasons I did not use https in the tweet:

  1. It made it 7 characters longer for no obvious extra fun
  2. It is harder to prove working by serving your own content as you would need curl -k or similar to make the host name ‘https’ be OK vs the name used in the actual server you would target.

The URL Buffalo buffalo

A surprisingly large number of people thought it reminded them of the old buffalo buffalo thing.

A bug that was 23 years old or not

This is a tale of cookies, Internet code and a CVE. It goes back a long time so please take a seat, lean back and follow along.

The scene is of course curl, the internet transfer tool and library I work on.

1998

In October 1998 we shipped curl 4.9. In 1998. Few people had heard of curl or used it back then. This was a few months before the curl website would announce that curl achieved 300 downloads of a new release. curl was still small in every meaning of the word at that time.

curl 4.9 was the first release that shipped with the “cookie engine”. curl could then receive HTTP cookies, parse them, understand them and send back cookies properly in subsequent requests. Like the browsers did. I wrote the bigger part of the curl code for managing cookies.

In 1998, the only specification that existed and described how cookies worked was a very brief document that Netscape used to host called cookie_spec. I keep a copy of that document around for curious readers. It really does not document things very well and it leaves out enormous amounts information that you had to figure out by inspecting other clients.

The cookie code I implemented than was based on that documentation and what the browsers seemed to do at the time. It seemed to work with numerous server implementations. People found good use for the feature.

2000s

This decade passed with a few separate efforts in the IETF to create cookie specifications but they all failed. The authors of these early cookie specs probably thought they could create standards and the world would magically adapt to them, but this did not work. Cookies are somewhat special in the regard that they are implemented by so many different authors, code bases and websites that fundamentally changing the way they work in a “decree from above” like that is difficult if not downright impossible.

RFC 6265

Finally, in 2011 there was a cookie rfc published! This time with the reversed approach: it primarily documented and clarified how cookies were actually already being used.

I was there and I helped it get made by proving my views and opinions. I did not agree to everything that the spec includes (you can find blog posts about some of those details), but finally having a proper spec was still a huge improvement to the previous state of the world.

Double syntax

What did not bother me much at the time, but has been giving me a bad rash ever since, is the peculiar way the spec is written: it provides one field syntax for how servers should send cookies, and a different one for what syntax clients should accept for cookies.

Two syntax for the same cookies.

This has at least two immediate downsides:

  1. It is hard to read the spec as it is very easy to to fall over one of those and assume that syntax is valid for your use case and accidentally get the wrong role’s description.
  2. The syntax defining how to send cookie is not really relevant as the clients are the ones that decide if they should receive and handle the cookies. The existing large cookie parsers (== browsers) are all fairly liberal in what they accept so nobody notices nor cares about if the servers don’t follow the stricter syntax in the spec.

RFC 6265bis

Since a few years back, there is ongoing work in IETF on revising and updating the cookie spec of 2011. Things have evolved and some extensions to cookies have been put into use in the world and deserves to be included in the spec. If you would to implement code today that manage cookies, the old RFC is certainly not enough anymore. This cookie spec update work is called 6265bis.

curl is up to date and compliant with what the draft versions of RFC 6265bis say.

The issue about the double syntax from above is still to be resolved in the document, but I faced unexpectedly tough resistance when I recently shared my options and thoughts about that spec peculiarity.

It can be noted that fundamentally, cookies still work the same way as they did back in 1998. There are added nuances and knobs sure, but the basic principles have remained. And will so even in the cookie spec update.

One of oddities of cookies is that they don’t work on origins like most other web features do.

HTTP Request tunneling

While cookies have evolved slowly over time, the HTTP specs have also been updated and refreshed a few times over the decades, but perhaps even more importantly the HTTP server implementations have implemented stricter parsing policies as they have (together with the rest of the world) that being liberal in what you accept (Postel’s law) easily lead to disasters. Like the dreaded and repeated HTTP request tunneling/smuggling attacks have showed us.

To combat this kind of attack, and probably to reduce the risk of other issues as well, HTTP servers started to reject incoming HTTP requests early if they appear “illegal” or malformed. Block them already at the door and not letting obvious crap in. In particular this goes for control codes in requests. If you try to send a request to a reasonably new HTTP server today that contains a control code, chances are very high that the server will reject the request and just return a 400 response code.

With control code I mean a byte value between 1 and 31 (excluding 9 which is TAB)

The well known HTTP server Apache httpd has this behavior enabled by default since 2.4.25, shipped in December 2016. Modern nginx versions seem to do this as well, but I have not investigated since exactly when.

Cookies for other hosts

If cookies were designed today for the first time, they certainly would be made different.

A website that sets cookies sends cookies to the client. For each cookie it sends, it sets a number of properties for the cookie. In particular it sets matching parameters for when the cookie should be sent back again by the client.

One of these cookie parameters set for a cookie is the domain that need to match for the client to send it. A server that is called www.example.com can set a cookie for the entire example.com domain, meaning that the cookie will then be sent by the client also when visiting second.example.com. Servers can set cookies for “sibling sites!

Eventually the two paths merged

The cookie code added to curl in 1998 was quite liberal in what content it accepted and while it was of course adjusted and polished over the years, it was working and it was compatible with real world websites.

The main driver for changes in that area of the code has always been to make sure that curl works like and interoperates with other cookie-using agents out in the wild.

CVE-2022-35252

In the end of June 2022 we received a report of a suspected security problem in curl, that would later result in our publication of CVE-2022-35252.

As it turned out, the old cookie code from 1998 accepted cookies that contained control codes. The control codes could be part of the name or the the content just fine, and if the user enabled the “cookie engine” curl would store those cookies and send them back in subsequent requests.

Example of a cookie curl would happily accept:

Set-Cookie: name^a=content^b; domain=.example.com

The ^a and ^b represent control codes, byte code one and two. Since the domain can mark the cookie for another host, as mentioned above, this cookie would get included for requests to all hosts within that domain.

When curl sends a cookie like that to a HTTP server, it would include a header field like this in its outgoing request:

Cookie: name^a=content^b

400

… to which a default configure Apache httpd and other servers will respond 400. For a script or an application that received theses cookies, further requests will be denied for as long as the cookies keep getting sent. A denial of service.

What does the spec say?

The client side part of RFC 6265, section 5.2 is not easy to decipher and figuring out that a client should discard cookies with control cookies requires deep studies of the document. There is in fact no mention of “control codes” or this byte range in the spec. I suppose I am just a bad spec reader.

Browsers

It is actually easier to spot what the popular browsers do since their source codes are easily available, and it turns out of course that both Chrome and Firefox already ignore incoming cookies that contain any of the bytes

%01-%08 / %0b-%0c / %0e-%1f / %7f

The range does not include %09, which is TAB and %0a / %0d which are line endings.

The fix

The curl fix was not too surprisingly and quite simply to refuse cookie fields that contain one or more of those banned byte values. As they are not accepted by the browser’s already, the risk that any legitimate site are using them for any benign purpose is very slim and I deem this change to be nearly risk-free.

The age of the bug

The vulnerable code has been in curl versions since version 4.9 which makes it exactly 8,729 days (23.9 years) until the shipped version 7.85.0 that fixed it. It also means that we introduced the bug on project day 201 and fixed it on day 8,930.

The code was not problematic when it shipped and it was not problematic during a huge portion of the time it has been used by a large amount of users.

It become problematic when HTTP servers started to refuse HTTP requests they suspected could be malicious. The way this code turned into a denial of service was therefore more or less just collateral damage. An unfortunate side effect.

Maybe the bug was born first when RFC 6265 was published. Maybe it was born when the first widely used HTTP server started to reject these requests.

Project record

8,729 days is a new project record age for a CVE to have been present in the code until found. It is still the forth CVE that were lingering around for over 8,000 days until found.

Credits

Thanks to Stefan Eissing for digging up historic Apache details.

Axel Chong submitted the CVE-2022-35252 report.

Campfire image by Martin Winkler from Pixabay

curl’s TLS fingerprint

Every human has a unique fingerprint. With only an impression of a person’s fingertip, it is possible to follow the lead back to the single specific individual wearing that unique pattern.

TLS fingerprints

The phrase TLS fingerprint is of course in this spirit. A pattern in a TLS handshake that allows an involved party to tell or at least guess with a certain level of accuracy what client software that performed it – purely based on how exactly the TLS magic is done. There are numerous different ways and variations a client can perform a TLS handshake and still be standards compliant. There is a long list of extensions that can vary in content, the order of the list of extensions, the ciphers to accept, the allowed TLS versions, steps performed, the order and sequence of those steps and more.

When a network client connects to a remote site and makes a TLS handshake with the server, the server can basically add up all those details and make an educated guess exactly which client that connects to it. One method to do it is called JA3 and produces a 32 digit hexadecimal number as output. (The three creators of this algorithm all have JA as their initials!)

In use out there

I have recently talked with customers and users who have faced servers that refused them access when they connected to sites using curl, but allowed them access to the site when they instead use one of the popular browsers – or if curl was tweaked to look like one of those browsers. It might be a trend in the making. There might be more sites out there now that reject clients that produce the wrong fingerprint then there used to be.

Why

Presumably there are many reasons why servers want to limit access to a subset of clients, but I think the general idea is that they want to prevent “illegitimate” user agents from accessing their sites.

For example, I have seen online market sites use this method in an what I have perceived as an attempt to block bots and web scrapers. Or they do it to block malware or other hostile clients that scour their website.

How

There’s this JA3 page that shows lots of implementations for many services that can figure out clients’ TLS fingerprints and act on them. And there’s nothing that says you have to do it with JA3. There’s likely to be numerous other ways and algorithms as well.

There are also companies that offer commercial services to filter off mismatching clients from your site. This is real business.

A TLS Client hello message has lots of info.

Other fingerprinting

In the earlier days of the web, web sites used more basic ways to detect and filter out bots and non-browser user clients. The original and much simpler way is to check the User-Agent: field that HTTP clients pass on, but has also sometimes been extended to check the order of the sent HTTP headers and in some cases, servers have used elaborate JavaScript schemes in order to try to “smoke out” the clients that don’t seem to act like full-fledged browsers.

If the clients use HTTP/2, that too allows for more details to fingerprint.

As the web has transitioned over to almost exclusively use HTTPS, it has severely increased the ways a server can fingerprint clients, and at the same time made it harder for non-browser clients to look exactly like browsers.

Allow list or block list

Sites that use TLS fingerprints to allow access, of course do not want too many false positives. They want to allow all “normal” browser-based visitors, even if they use a little older versions and also if they use somewhat older or less common operating systems.

This means that they either have to work hard to get an extensive list of acceptable hashes in an accept list or they add known non-desired clients in a block list. I would imagine that if you go the accept list route, that’s how companies can sell this services as that is maintenance intensive work.

Users of alternative and niche browsers are sometimes also victims in this scheme if they stand out enough.

Altering the fingerprint

The TLS fingerprints have the interesting feature compared to human fingertip prints, that they are the result of a set of deliberate actions and not just a pattern you are born to wear. They are therefore a lot easier to change.

With curl version C using TLS library T of version V, the TLS fingerprint is a function that involves C, T and V. And the options O set by curl. By changing one or more of those variables, you are likely to alter the TLS fingerprint.

Match a browser exactly

To be most certain that no site will reject your curl request because of its TLS fingerprint, you would alter the print to look exactly like the one of a popular browser. You can suspect that most sites want their regular human browser-using visitors to be able to access them.

To make curl look exactly like a browser you also likely need to do more than just change C, O, T and V from the section above. You also need to make sure that the TLS library you use produces its lists of extensions and ciphers in exactly the same order etc. This may require that you alter options and maybe even source code.

curl-impersonate

This is a custom build of curl that can impersonate the four major browsers: Chrome, Edge, Safari & Firefox. curl-impersonate performs TLS and HTTP handshakes that are identical to that of a real browser.

curl-impersonate is a modified curl build and the project also provides docker images and more to help users to use it easily.

I cannot say right now if any of the changes done for curl-impersonate will get merged into the upstream curl project, but it will also depend on what users want and how the use of TLS fingerprinting spread or changes going forward.

Program a browser

Another popular way to work around this kind of blocking is to simply program a browser to do the job. Either a headless browser or with tools like Selenium. Since these methods make the TLS handshake using a browser “engine”, they are unlikely to get blocked by these filters.

Cat and mouse

Servers add more hurdles to attempt to block unwanted clients.

Clients change to keep up with the servers and to still access the sites in spite of what the server admins want.

Future

As early as only a few years ago I had never heard of any site that blocked clients because of their TLS handshake. Through recent years I have seen it happen and the use of it seems to have increased. I don’t know of any way to measure if this is actually true or just my feeling.

I cannot rule out that we are going to see this more going forward, even if I also believe that the work on circumventing these fingerprinting filters is just getting started. If the circumvention grows and becomes easy enough, maybe it will stifle servers from adding these filters as they will not be effective anyway?

Let us come back to this topic in a few years and see where it went.

Credits

Fingerprint image by Hebi B. from Pixabay