All posts by Daniel Stenberg

800 authors and counting

Today marks the day when we merged the commit authored by the 800th person in the curl project.

We turned 22 years ago this spring but it really wasn’t until 2010 when we switched to git when we started to properly keep track of every single author in the project. Since then we’ve seen a lot of new authors and a lot of new code.

The “explosion” is clearly visible in this graph generated with fresh data just this morning (while we were still just 799 authors). See how we’ve grown maybe 250 authors since 1 Jan 2018.

Author number 800 is named Nicolas Sterchele and he submitted an update of the TODO document. Appreciated!

As the graph above also shows, a majority of all authors only ever authored a single commit. If you did 10 commits in the curl project, you reach position #61 among all the committers while 100 commits takes you all the way up to position #13.

Become one!

If you too want to become one of the cool authors of curl, I fine starting point for that journey could be the Help Us document. If that’s not enough, you’re also welcome to contact me privately or maybe join the IRC channel for some socializing and “group mentoring”.

If we keep this up, we could reach a 1,000 authors in 2022…

curl ootw: –ftp-skip-pasv-ip

(Other command line options of the week.)

--ftp-skip-pasv-ip has no short option and it was added to curl in 7.14.2.

Crash course in FTP

Remember how FTP is this special protocol for which we create two connections? One for the “control” where we send commands and read responses and then a second one for the actual data transfer.

When setting up the second connection, there are two ways to do it: the active way and the passive way. The wording there is basically in the eyes of the FTP server: should the server be active or passive in the creation and that’s the key. The traditional underlying FTP commands to do this is either PORT or PASV.

Due to the prevalence of firewalls and other network “complications” these days, the passive style is dominant for FTP. That’s when the client asks the server to listen on a new port (by issuing the PASV command) and then the client connects to the server with a second connection.

The PASV response

When a server responds to a PASV command that the client sends to it, it sends back an IPv4 address and a port number for the client to connect to – in a rather arcane way that looks like this:

227 Entering Passive Mode (192,168,0,1,156,64)

This says the server listens to the IPv4 address 192.168.0.1 on port 40000 (== 156 x 256 + 64).

However, sometimes the server itself isn’t perfectly aware of what IP address it actually is accessible as “from the outside”. Maybe there’s a NAT involved somewhere, maybe there are even more than one NAT between the client and the server.

We know better

For the cases when the server responds with a crazy address, curl can be told to ignore the address in the response and instead assume that the IP address used for the control connection will in fact work for the data connection as well – this is generally true and has actually become even more certain over time as FTP servers these days typically never return a different IP address for PASV.

Enter the “we know better than you” option --ftp-skip-pasv-ip.

What about IPv6 you might ask

The PASV command, as explained above, is explicitly only working with IPv4 as it talks about numerical IPv4 addresses. FTP was actually first described in the early 1970s, quite a lot time before IPv6 was born.

When FTP got support for IPv6, another command was introduced as a PASV replacement.: the EPSV command. If you run curl with -v (verbose mode) when doing FTP transfers, you will see that curl does indeed first try to use EPSV before it eventually falls back and tries PASV if the previous command doesn’t work.

The response to the EPSV command doesn’t even include an IP address but then it always assumes the same address as the control connection and it only returns back a TCP port number.

Example

Download a file from that server giving you a crazy PASV response:

curl --ftp-skip-pasv-ip ftp://example.com/file.txt

Related options

Change to active FTP mode with --ftp-port, switch off EPSV attempts with --disable-epsv.

on-demand buffer alloc in libcurl

Okay, so I’ll delve a bit deeper into the libcurl internals than usual here. Beware of low-level talk!

There’s a never-ending stream of things to polish and improve in a software project and curl is no exception. Let me tell you what I fell over and worked on the other day.

Smaller than what holds Linux

We have users who are running curl on tiny devices, often put under the label of Internet of Things, IoT. These small systems typically have maybe a megabyte or two of ram and flash and are often too small to even run Linux. They typically run one of the many different RTOS flavors instead.

It is with these users in mind I’ve worked on the tiny-curl effort. To make curl a viable alternative even there. And believe me, the world of RTOSes and IoT is literally filled with really low quality and half-baked HTTP client implementations. Often certainly very small but equally as often with really horrible shortcuts or protocol misunderstandings in them.

Going with curl in your IoT device means going with decades of experience and reliability. But for libcurl to be an option for many IoT devices, a libcurl build has to be able to get really small. Both the footprint on storage but also in the required amount of dynamic memory used while executing.

Being feature-packed and attractive for the high-end users and yet at the same time being able to get really small for the low-end is a challenge. And who doesn’t like a good challenge?

Reduce reduce reduce

I’ve set myself on a quest to make it possible to build libcurl smaller than before and to use less dynamic memory. The first tiny-curl releases were only the beginning and I already then aimed for a libcurl + TLS library within 100K storage size. I believe that goal was met, but I also think there’s more to gain.

I will make tiny-curl smaller and use less memory by making sure that when we disable parts of the library or disable specific features and protocols at build-time, they should no longer affect storage or dynamic memory sizes – as far as possible. Tiny-curl is a good step in this direction but the job isn’t done yet – there’s more “dead meat” to carve off.

One example is my current work (PR #5466) on making sure there’s much less proxy remainders left when libcurl is built without support for such. This makes it smaller on disk but also makes it use less dynamic memory.

To decrease the maximum amount of allocated memory for a typical transfer, and in fact for all kinds of transfers, we’ve just switched to a model with on-demand download buffer allocations (PR #5472). Previously, the download buffer for a transfer was allocated at the same time as the handle (in the curl_easy_init call) and kept allocated until the handle was cleaned up again (with curl_easy_cleanup). Now, we instead lazy-allocate it first when the transfer starts, and we free it again immediately when the transfer is over.

It has several benefits. For starters, the previous initial allocation would always first allocate the buffer using the default size, and the user could then set a smaller size that would realloc a new smaller buffer. That double allocation was of course unfortunate, especially on systems that really do want to avoid mallocs and want a minimum buffer size.

The “price” of handling many handles drastically went down, as only transfers that are actively in progress will actually have a receive buffer allocated.

A positive side-effect of this refactor, is that we could now also make sure the internal “closure handle” actually doesn’t use any buffer allocation at all now. That’s the “spare” handle we create internally to be able to associate certain connections with, when there’s no user-provided handles left but we need to for example close down an FTP connection as there’s a command/response procedure involved.

Downsides? It means a slight increase in number of allocations and frees of dynamic memory for doing new transfers. We do however deem this a sensible trade-off.

Numbers

I always hesitate to bring up numbers since it will vary so much depending on your particular setup, build, platform and more. But okay, with that said, let’s take a look at the numbers I could generate on my dev machine. A now rather dated x86-64 machine running Linux.

For measurement, I perform a standard single transfer getting a 8GB file from http://localhost, written to stderr:

curl -s http://localhost/8GB -o /dev/null

With all the memory calls instrumented, my script counts the number of memory alloc/realloc/free/etc calls made as well as the maximum total memory allocation used.

The curl tool itself sets the download buffer size to a “whopping” 100K buffer (as it actually makes a difference to users doing for example transfers from localhost or other really high bandwidth setups or when doing SFTP over high-latency links). libcurl is more conservative and defaults it to 16K.

This command line of course creates a single easy handle and makes a single HTTP transfer without any redirects.

Before the lazy-alloc change, this operation would peak at 168978 bytes allocated. As you can see, the 100K receive buffer is a significant share of the memory used.

After the alloc work, the exact same transfer instead ended up using 136188 bytes.

102,400 bytes is for the receive buffer, meaning we reduced the amount of “extra” allocated data from 66578 to 33807. By 49%

Even tinier tiny-curl: in a feature-stripped tiny-curl build that does HTTPS GET only with a mere 1K receive buffer, the total maximum amount of dynamically allocated memory is now below 25K.

Caveats

The numbers mentioned above only count allocations done by curl code. It does not include memory used by system calls or, when used, third party libraries.

Landed

The changes mentioned in this blog post have landed in the master branch and will ship in the next release: curl 7.71.0.

curl ootw: –socks5

(Previous option of the week posts.)

--socks5 was added to curl back in 7.18.0. It takes an argument and that argument is the host name (and port number) of your SOCKS5 proxy server. There is no short option version.

Proxy

A proxy, often called a forward proxy in the context of clients, is a server that the client needs to connect to in order to reach its destination. A middle man/server that we use to get us what we want. There are many kinds of proxies. SOCKS is one of the proxy protocols curl supports.

SOCKS

SOCKS is a really old proxy protocol. SOCKS4 is the predecessor protocol version to SOCKS5. curl supports both and the newer version of these two, SOCKS5, is documented in RFC 1928 dated 1996! And yes: they are typically written exactly like this, without any space between the word SOCKS and the version number 4 or 5.

One of the more known services that still use SOCKS is Tor. When you want to reach services on Tor, or the web through Tor, you run the client on your machine or local network and you connect to that over SOCKS5.

Which one resolves the host name

One peculiarity with SOCKS is that it can do the name resolving of the target server either in the client or have it done by the proxy. Both alternatives exists for both SOCKS versions. For SOCKS4, a SOCKS4a version was created that has the proxy resolve the host name and for SOCKS5, which is really the topic of today, the protocol has an option that lets the client pass on the IP address or the host name of the target server.

The --socks5 option makes curl itself resolve the name. You’d instead use --socks5-hostname if you want the proxy to resolve it.

--proxy

The --socks5 option is basically considered obsolete since curl 7.21.7. This is because starting in that release, you can now specify the proxy protocol directly in the string that you specify the proxy host name and port number with already. The server you specify with --proxy. If you use a socks5:// scheme, curl will go with SOCKS5 with local name resolve but if you instead use socks5h:// it will pick SOCKS5 with proxy-resolved host name.

SOCKS authentication

A SOCKS5 proxy can also be setup to require authentication, so you might also have to specify name and password in the --proxy string, or set separately with --proxy-user. Or with GSSAPI, so curl also supports --socks5-gssapi and friends.

Examples

Fetch HTTPS from example.com over the SOCKS5 proxy at socks5.example.org port 1080. Remember that –socks5 implies that curl resolves the host name itself and passes the address to to use to the proxy.

curl --socks5 socks5.example.org:1080 https://example.com/

Or download FTP over the SOCKS5 proxy at socks5.example port 9999:

curl --socks5 socks5.example:9999 ftp://ftp.example.com/SECRET

Useful trick!

A very useful trick that involves a SOCKS proxy is the ability OpenSSH has to create a SOCKS tunnel for us. If you sit at your friends house, you can open a SOCKS proxy to your home machine and access the network via that. Like this. First invoke ssh, login to your home machine and ask it to setup a SOCKS proxy:

ssh -D 8080 user@home.example.com

Then tell curl (or your browser, or both) to use this new SOCKS proxy when you want to access the Internet:

curl --socks5 localhost:8080 https:///www.example.net/

This will effectively hide all your Internet traffic from your friends snooping and instead pass it all through your encrypted ssh tunnel.

Related options

As already mentioned above, --proxy is typically the preferred option these days to set the proxy. But --socks5-hostname is there too and the related --socks4 and --sock4a.

AI-powered code submissions

Who knows, maybe May 18 2020 will mark some sort of historic change when we look back on this day in the future.

On this day, the curl project received the first “AI-powered” submitted issues and pull-requests. They were submitted by MonocleAI, which is described as:

MonocleAI, an AI bug detection and fixing platform where we use AI & ML techniques to learn from previous vulnerabilities to discover and fix future software defects before they cause software failures.

I’m sure these are still early days and we can’t expect this to be perfected yet, but I would still claim that from the submissions we’ve seen so far that this is useful stuff! After I tweeted about this “event”, several people expressed interest in how well the service performs, so let me elaborate on what we’ve learned already in this early phase. I hope I can back in the future with updates.

Disclaimers: I’ve been invited to try this service out as an early (beta?) user. No one is saying that this is complete or that it replaces humans. I have no affiliation with the makers of this service other than as a receiver of their submissions to the project I manage. Also: since this service is run by others, I can’t actually tell how much machine vs humans this actually is or how much human “assistance” the AI required to perform these actions.

I’m looking forward to see if we get more contributions from this AI other than this first batch that we already dealt with, and if so, will the AI get better over time? Will it look at how we adjusted its suggested changes? We know humans adapt like that.

Pull-request quality

Monocle still needs to work on adapting its produced code to follow the existing code style when it submits a PR, as a human would. For example, in curl we always write the assignment that initializes a variable to something at declaration time immediately on the same line as the declaration. Like this:

int name = 0;

… while Monocle, when fixing cases where it thinks there was an assignment missing, adds it in a line below, like this:

int name;
name = 0;

I can only presume that in some projects that will be the preferred style. In curl it is not.

White space

Other things that maybe shouldn’t be that hard for an AI to adapt to, as you’d imagine an AI should be able to figure out, is other code style issues such as where to use white space and where not no. For example, in the curl project we write pointers like char * or void *. That is with the type, a space and then an asterisk. Our code style script will yell if you do this wrong. Monocle did it wrong and used it without space: void*.

C89

We use and stick to the most conservative ANSI C version in curl. C89/C90 (and we have CI jobs failing if we deviate from this). In this version of C you cannot mix variable declarations and code. Yet Monocle did this in one of its PRs. It figured out an assignment was missing and added the assignment in a new line immediately below, which of course is wrong if there are more variables declared below!

int missing;
missing = 0; /* this is not C89 friendly */
int fine = 0;

NULL

We use the symbol NULL in curl when we zero a pointer . Monocle for some reason decided it should use (void*)0 instead. Also seems like something virtually no human would do, and especially not after having taken a look at our code…

The first issues

MonocleAI found a few issues in curl without filing PRs for them, and they were basically all of the same kind of inconsistency.

It found function calls for which the return code wasn’t checked, while it was checked in some other places. With the obvious and rightful thinking that if it was worth checking at one place it should be worth checking at other places too.

Those kind of “suspicious” code are also likely much harder fix automatically as it will include decisions on what the correct action should actually be when checks are added, or perhaps the checks aren’t necessary…

Credits

Image by Couleur from Pixabay

curl ootw: –range

--range or -r for short. As the name implies, this option is for doing “range requests”. This flag was available already in the first curl release ever: version 4.0. This option requires an extra argument specifying the specific requested range. Read on the learn how!

What exactly is a range request?

Get a part of the remote resource

Maybe you have downloaded only a part of a remote file, or maybe you’re only interested in getting a fraction of a huge remote resource. Those are two situations in which you might want your internet transfer client to ask the server to only transfer parts of the remote resource back to you.

Let’s say you’ve tried to download a 32GB file (let’s call it a huge file) from a slow server on the other side of the world and when you only had 794 bytes left to transfer, the connection broke and the transfer was aborted. The transfer took a very long time and you prefer not to just restart it from the beginning and yet, with many file formats those final 794 bytes are critical and the content cannot be handled without them.

We need those final 794 bytes! Enter range requests.

With range requests, you can tell curl exactly what byte range to ask for from the server. “Give us bytes 12345-12567” or “give us the last 794 bytes”. Like this:

curl --range 12345-12567 https://example.com/

and:

curl --range -794 https://example.com/

This works with curl with several different protocols: HTTP(S), FTP(S) and SFTP. With HTTP(S), you can even be more fancy and ask for multiple ranges in the same request. Maybe you want the three sections of the resource?

curl --range 0-1000,2000-3000,4000-5000 https://example.com/

Let me again emphasize that this multi-range feature only exists for HTTP(S) with curl and not with the other protocols, and the reason is quite simply that HTTP provides this by itself and we haven’t felt motivated enough to implement it for the other protocols.

Not always that easy

The description above is for when everything is fine and easy. But as you know, life is rarely that easy and straight forward as we want it to be and nether is the --range option. Primarily because of this very important detail:

Range support in HTTP is optional.

It means that when curl asks for a particular byte range to be returned, the server might not obey or care and instead it delivers the whole thing anyway. As a client we can detect this refusal, since a range response has a special HTTP response code (206) which won’t be used if the entire thing is sent back – but that’s often of little use if you really want to get the remaining bytes of a larger resource out of which you already have most downloaded since before.

One reason it is optional for HTTP and why many sites and pages in the wild refuse range requests is that those sites and pages generate contend on demand, dynamically. If we ask for a byte range from a static file on disk in the server offering a byte range is easy. But if the document is instead the result of lots of scripts and dynamic content being generated uniquely in the server-side at the time of each request, it isn’t.

HTTP 416 Range Not Satisfiable

If you ask for a range that is outside of what the server can provide, it will respond with a 416 response code. Let’s say for example you download a complete 200 byte resource and then you ask that server for the range 200-202 – you’ll get a 416 back because 200 bytes are index 0-199 so there’s nothing available at byte index 200 and beyond.

HTTP other ranges

--range for HTTP content implies “byte ranges”. There’s this theoretical support for other units of ranges in HTTP but that’s not supported by curl and in fact is not widely used over the Internet. Byte ranges are complicated enough!

Related command line options

curl also offers the --continue-at (-C) option which is a perhaps more user-friendly way to resume transfers without the user having to specify the exact byte range and handle data concatenation etc.

Help curl: the user survey 2020

The annual curl user survey is up. If you ever used curl or libcurl during the last year, please consider donating ten minutes of your time and fill in the question on the link below!

[no longer open]

The survey will be up for 14 days. Please share this with your curl-using friends as well and ask them to contribute. This is our only and primary way to find out what users actually do with curl and what you want with it – and don’t want it to do!

The survey is hosted by Google forms. The curl project will not track users and we will not ask who you are (and than some general details to get a picture of curl users in general).

The analysis from the 2019 survey is available.

curl ootw: -Y, –speed-limit

(Previous options of the week)

Today we take a closer look at one of the real vintage curl options. It was added already in early 1999. -Y for short, --speed-limit is the long version. This option needs an additional argument: <speed>. Let me describe exactly what that speed is and how it works below.

Slow or stale transfers

Very early on in curl’s lifetime, it became obvious to us that lots of times when you use curl in order to do an Internet transfer, that transfer could sometimes take a long time. Occasionally even ridiculously long times and it could seem that the transfers just stalled without any hope of resurrecting and completing its mission.

How do you tell curl to abandon such lost-hope transfers? The options we provide for timeouts provide one answer. But since the transfer speeds can vary greatly from time to time and machine to machine, you have to use a timeout value that uses an insane margin, which for the cases when everything flies fast turns out annoying.

We needed another way to detect and abort these stale transfers. Enter speed limit.

Lower than speed-limit bytes per second during speed-time

The --speed-limit <speed> you tell curl is the transfer speed threshold below which you think the transfer is untypically slow, specified as bytes per second. If you have a really fast Internet, you might for example think that a transfer that is below 1000 bytes/second is a sign of something not being right.

But just measuring a the transfer speed to be below that special threshold in a single snapshot is not a strong enough signal for curl to act on it. The speed also needs to be measure below that threshold during --limit-time <seconds>. If the transfer speed just incidentally sometimes and very quickly drops below the threshold (bad wifi?) that’s not a reason for concern. The default limit time (when --limit-speed is used without --limit-time set) is 30. The transfer speed needs to be measured below the threshold for that many consecutive seconds (and it samples once per second).

If curl deems that your transfer speed was too slow during the given period, it will break the transfer and return 28. Timeout.

These two options are entirely protocol independent and work for all transfers using any of the protocols curl supports.

Examples

Tell curl to give up the transfer if slower than 1000 bytes per second during 20 seconds:

curl --speed-limit 1000 --speed-time 20 https://example.com

Tell curl to give up the transfer if slower than 100000 bytes per second during 60 seconds:

curl --speed-limit 100000 --speed-time 60 https://example.com

It also works the same for uploads. If the speed is below 2000 bytes per second during 45 seconds, abort:

curl --speed-limit 2000 --speed-time 45 ftp://example.com/upload/ -T sendaway.txt

Related options

--max-time and --connect-timeout are options with similar functionality and purpose, and you can indeed in many cases add those as well.

Manual cURL cURL

The HP Color LaserJet CP3525 Printer looks like any other ordinary printer done by HP. But there’s a difference!

A friend of mine fell over this gem, and told me.

TCP/IP Settings

If you go to the machine’s TCP/IP settings using the built-in web server, the printer offers the ordinary network configure options but also one that sticks out a little exta. The “Manual cURL cURL” option! It looks like this:

I could easily confirm that this is genuine. I did this screenshot above by just googling for the string and printer model, since there appears to exist printers like this exposing their settings web server to the Internet. Hilarious!

What?

How on earth did that string end up there? Certainly there’s no relation to curl at all except for the actual name used there? Is it a sign that there’s basically no humans left at HP that understand what the individual settings on that screen are actually meant for?

Given the contents in the text field, a URL containing the letters WPAD twice, I can only presume this field is actually meant for Web Proxy Auto-Discovery. I spent some time trying to find the user manual for this printer configuration screen but failed. It would’ve been fun to find “manual cURL cURL” described in a manual! They do offer a busload of various manuals, maybe I just missed the right one.

Does it use curl?

Yes, it seems HP generally use curl at least as I found the “Open-Source Software License Agreements for HP LaserJet and ScanJet Printers” and it contains the curl license:

The curl license as found in the HP printer open source report.

HP using curl for Print-Uri?

Independently, someone else recently told me about another possible HP + curl connection. This user said his HP printer makes HTTP requests using the user-agent libcurl-agent/1.0:

I haven’t managed to get this confirmed by anyone else (although the license snippet above certainly implies they use curl) and that particular user-agent string has been used everywhere for a long time, as I believe it is copied widely from the popular libcurl example getinmemory.c where I made up the user-agent and put it there already in 2004.

Credits

Frank Gevaerts tricked me into going down this rabbit hole as he told me about this string.