All posts by Daniel Stenberg

curl 8.2.0

Welcome to another curl release. You know how this dance goes…

Numbers

the 220th release
5 changes
50 days (total: 9,252)

122 bug-fixes (total: 9,167)
177 commits (total: 30,606)
0 new public libcurl function (total: 91)
1 new curl_easy_setopt() option (total: 303)

4 new curl command line option (total: 255)
55 contributors, 34 new (total: 2,922)
35 authors, 20 new (total: 1,170)
1 security fixes (total: 146)

Release presentation

Security

fopen race condition (medium)

CVE-2023-32001. libcurl can be told to save cookies, HSTS and/or alt-svc data to files. When doing this, it called stat() followed by fopen() in a way that made it vulnerable to a TOCTOU (Time of Check, Time of Use) race condition problem.

By exploiting this flaw, an attacker could trick the victim to create or overwrite protected files holding this data in ways it was not intended to.

Changes

curl: add –ca-native and –proxy-ca-native

The command line tool (and library) got new options to ask it to use the systems “native” CA storage. Currently only work on Windows when curl is built to use an OpenSSL fork.

curl: add –trace-ids

This option makes the trace log files include connection and transfer identifiers, which greatly helps debugging transfers doing many (parallel) transfers.

CURLOPT_MAIL_RCPT_ALLOWFAILS replaces CURLOPT_MAIL_RCPT_ALLLOWFAILS

Provide the option without the typo!

add –haproxy-clientip flag to set client IPs

Now users of the tool (and library) pass on specific IP addresses instead of simply using the current one.

add CURLINFO_CONN_ID and CURLINFO_XFER_ID

Two options that allows the application to extract the connection and transfer “Id” of the current transfer, presumably from a debugfunction callback and the likes.

Bugfixes

We have again fixed more than a hundred problems in this release cycle. Here follows a subset that I suspect might be among the most interesting ones.

examples: we’ve added and extended numerous

The ambition is to gradually over time provide examples that show use of all curl_easy_setopt options. We are still way off from that.

http2: numerous smaller and larger fixes

Several regressions and cleanups have been done that improves how HTTP/2 works compared to previous releases.

http2: send HEADER and DATA together

When sending POST requests, libcurl now does a better job in putting the initial outgoing HEADER and DATA frames together, most likely in the same TLS frame.

http3: upload EAGAIN handling

EAGAIN handling for HTTP/3 uploads was fixed, like it was for HTTP/2 as well.

http: fix the outgoing Cookie: header length check

The check that would prevent too long outgoing cookie headers was off by up to a few hundred bytes.

libssh2: use custom memory functions (again)

Bring back use of custom memory functions with libssh2 as otherwise it actually cannot be used with a debug build of curl (or when libssh2 is used as a DLL on windows) due to naive presumptions in the libssh2 API.

runtests: many improvements, leading to -j

Introducing parallel tests.

sectransp: fix EOF handling

A regression caused curl misbehave on end of connection using TLS when built to use Secure Transport.

timeval: use CLOCK_MONOTONIC_RAW if available

For platforms with this clock option, curl now prefers that in an effort to avoid a time that can go backwards.

tool_writeout_json: fix encoding of control characters

The output of control codes in the generated JSON with --json now works better.

urlapi: have *set(PATH) prepend a slash if one is missing

Setting a path using the URL API without a leading slash would previously generate a broken URL when it was extracted. Starting now, libcurl will prepend a slash if there is none.

urlapi: scheme must start with alpha

The URL parser would previously allow a few other characters to start a scheme as well. No more.

tool_parsecfg: accept line lengths up to 10M

The config file parser now allows lines to be up to 10 megabytes. For those odd users generating files with huge data components embedded.

curl user survey 2023 analysis

The curl user survey 2023 ran for two full weeks in the end of May, in the same fashion we run it every year.

I have then collected all the answers, ran the numbers and looked at the trends and put all the conclusions and graphs into a single document for everyone to enjoy.

Five quick things

If you are in too much of a hurry to read it all, here are five key facts this year’s survey revealed:

  1. curl users leave Twitter and join Mastodon in notable amounts
  2. Windows 11 is growing quickly as a platform curl users are on
  3. HTTP/3 is used by a quarter of all curl users
  4. WebSocket reached the top-10 of most used protocols before its first birthday
  5. The positive comments in section 21 are heart-warming

The document

The final document is a 3MB 36 page PDF with collected data and conclusions. You find it here:

curl-user-survey-2023-analysis

Enjoy!

Video presentation

I will do a dedicated live-streamed video presentation of this curl user survey 2023 analysis and talk about how I see the numbers, the trends and maybe also show some additional data that was left out from the final document.

Previous years

This is the 10th year we run the survey. Here are links to five previous analysis documents:

2022, 2021, 2020, 2019, 2018

WebSocket with libcurl webinar

June 15, 2023 10:00 AM PST (19:00 CEST, 17:00 UTC)

Register Here

This will be was an overview by Daniel Stenberg of the new WebSocket support in curl and in particular how to use this API with libcurl in your applications. It is followed by a live Q&A.

The session will be recorded and made available after the fact.

This webinar is done as a registration-only event on Zoom. If this is problematic for you, there will be a separate second version of this webinar done over Twitch at a later date.

NVD damage continued

There is something about having your product installed in over twenty billion instances all over the world and even out of the globe. In my case it helps me remain focused on and committed to working on the security aspects of curl. Ideally, we will never have our heartbleed moment.

Security is also a generally growing concern in the world around us and Open Source security perhaps especially so. This is one reason why NVD making things up is such a big problem.

The National Vulnerability Database (NVD) has a global presence. They host and share information about security vulnerabilities. If you search for a CVE Id using your favorite search engine, it is likely that the first result you get is a link to NVD’s page with information about that specific CVE. They take it upon themselves to educate the world about security issues. A job that certainly is needed but also one that puts a responsibility and requirement on them to be accurate. When they get things wrong they help distributing misinformation. Misinformation makes people potentially draw the wrong conclusions or act in wrong, incomplete or exaggerated ways.

Low or Medium severity issues

There are well-known, recognized and reputable Open Source projects who by policy never issue CVEs for security vulnerabilities they rank severity low or medium. (I will not identify such projects here because it is not the point of this post.)

Such a policy successfully avoids the risk that NVD will greatly inflate their issues since they can already only be high or critical. But is it helping the users and the ecosystem at large?

In the curl project we have a policy which makes us register a CVE for every single reported or self-detected problem that can have a security impact. Either at will or by mistake. This includes a fair amount of low and medium issues. The amount of low and medium issues as a total of all issues increases over time as we keep finding issues, but the really bad ones are less frequently reported.

As we have all data recorded and stored, we can visualize this development over time. Below is a graph showing the curl vulnerability and severity level trends since 2010.

Severity distribution in reported curl vulnerabilities since 2010

Out of the 145 published curl security vulnerabilities so far, 28% have been rated severity high or critical while 104 of them were set low or medium by the curl security team.

I think this trend is easy to explain. It is because of two separate developments:

  1. We as a project have matured and have learned over time how to test better, write code better to minimize risk and we have existed for a while to have a series of truly bad flaws already found (and fixed). We make less serious bugs these days.
  2. Since 2010, lots of more people look for security problems and these days we are much better better at identifying problems as security related and we have better tools, while for a few years ago the same problem would just have become “a bug fix”.

Deciding severity

When a security problem is reported to curl, the curl security team and the reporter collaborate. First to make sure we understand the full width of the problem and its security impact. What can happen and what is required for that badness to trigger? Further, we assess what the likeliness that this can be done on purpose or by mistake and how common those situations and required configurations might be. We know curl, we know the code but we also often go back and double-check exactly what the documentation says and promises to better assess what users should be expected to know and do, and what is not expected from them etc. And we re-read the involved code again and again.

curl is currently a little over 160,000 lines of feature packed C code (excluding blank lines). It might not always be straight forward to a casual observer exactly how everything is glued together even if we try to also document internals to help you find deeper knowledge.

I think it is fair to say that it requires a certain amount of experience and time spent with the code to be able to fully understand a curl security issue and what impact it might have. I believe it is difficult or next to impossible for someone without knowledge about how it works to just casually read our security advisories and try to second-guess our assessments and instead make your own.

Yet this is exactly what NVD does. They don’t even ask us for help or for clarifications of anything. They think they can assess the severity of our problems without knowing curl nor fully understanding the reported issues.

A case to prove my point

In March 2023 we published a security advisory for the problem commonly referred to as CVE-2023-27536.

This is potentially a security problem, probably never hurts anyone and is in fact quite unlikely to ever cause a problem. But it might. So after deliberating we accepted it and ranked it severity low.

Bear with me here. I’ll spend two paragraphs revealing some details from the internal libcurl engine:

The problem is of a kind we have had several times in the past: curl has a connection pool and when a user makes a subsequent request which this particular option modified (compared to how it was when the previous connection was setup) it would wrongly reuse the first connection thinking they had the exact same properties.

The second would then accidentally get the wrong rights because it was setup differently. Still, the first connection would need the correct credentials and everything and so would the second one, it would just differ depending on what “GSSAPI delegation” that is allowed.

NVD ranks this

The person or team at NVD whose job it is to make up stuff for security vulnerabilities ranked this as CRITICAL 9.8. Almost as bad as it gets apparently. 10 is the max as you might recall.

When realizing this, at the end of May, I first fell off my chair in shock by this insanity, but after a quick recovery I emailed them (again) and complained (yet again) on setting this severity for *27536. I used the word “ridiculous” in my email to describe their actions. Why and who benefits from them scaremongering the world like this? It makes no sense. On the contrary, this is bad for everyone.

As a reaction to my complaint, someone at NVD went back and agreed to revise the CVSS string they had set and suddenly it was “only” ranked HIGH 7.2. I say “someone” because they never communicate with names and never sign the emails which whomever I talk to. They are just “NVD”.

I objected to their new CVSS string as well. It is just not a high severity security problem!

In my new argument I changed two particular details in the CVSS string (compared to the one they insisted was good) and presented arguments for that. For your pleasure, I include my exact wording below. (Some emphasis is added here for display purposes.)

How I motivated a downgrade

I could possibly live with: AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:N/A:N (4.4) - even if that means Medium and we argue Low.

These are two changes and my motivations:

Attack complexity high - because how this requires that you actually have a working first communication and then do a second is slightly changed and you would expect the second to be different but in reality it accidentally reuse the first connection and therefore gives different/elevated rights.

It is a super-niche and almost impossible attack and there has been no report ever of anyone having suffered from this or even the existence of an application that actually would enable it to happen.

It is more likely to only happen by mistake by an application, but it also seems unlikely to ever be used by an application in a way that would trigger it since having the same user credentials with different values for GSS delegation and assume different access levels seems … weird.

This almost impossible chance of occurring is the primary reason we think this is a Low severity. With CVSS, it seems impossible to reach Low.

Privileges Required high - because the only way you can trigger this flaw is by having full privileges for the *same* user credentials that is later used again but with changed GSS API delegation set. While the previous connection is still live in the connection pool.

It would also only be an attack or a flaw if that second transfer actually assumes to have different access properties, which is probably debatable if users of the API would expect or not

CVSS still sucks

CVSS is a crap system so using this single-dimension number it seems next to impossible to actually get severity low report.

NVD wants “public sources”

NVD does not just take my word for how curl works. I mean, I only wrote a large chunk of it and am probably the single human that knows most about its internals and how it works. I also wrote the patch for this issue, I wrote the connection pool logic and I understand the problem exactly. Nope, just because I say so does not make it true.

My claims above about this issue can of course be verified by reading the publicly available source code and you can run tests to reproduce my claims. Not to mention that the functionality in question is documented.

But no.

They decided to agree to one of my proposed changes, which further downgraded the severity to MEDIUM 5.9. Quite far away from their initial stance. I think it is at least a partial victory.

For the second change to the CVSS string I requested, they demand that I provide more information for them. In their words:

There is no publicly available information about the CVE that clarifies your statement so we must request clarification from you and additionally have this detail added to the HackerOne report or some other public interface for transparency purposes prior to making changes to the CVSS vector.

… which just emphasizes exactly what I have stated already in this post. They set a severity on this without understanding the issue, with no knowledge of the feature that gets this wrong and without clues about what is actually necessary to trigger this flaw in the first place.

For people intimately familiar with curl internals, we actually don’t have to spell out all these facts with excruciating details. We know how the connection pool works, how the reuse of connections should work and what it means when curl gets it wrong. We have also had several other issues in this areas in the past. (It is a tricky area to get right.)

But it does not make this CVE more than a Low severity issue.

Conclusion

This issue is now stuck at this MEDIUM 5.9 at NVD. Much less bad than where they started. Possibly Low or Medium does not make a huge difference out there in the world.

I think it is outrageous that I need to struggle and argue for such a big and renowned organization to do right. I can’t do this for every CVE we have reported because it takes serious time and energy, but at the same time I have zero expectation of them getting this right. I can only assume that they are equally lost and bad when assessing security problems in other projects as well.

A completely broken and worthless system. That people seem to actually use.

It is certainly tempting to join the projects that do not report Low or Medium issues at all. If we would stop doing that, at least NVD would not shout wolf and foolishly claim they are critical.

My response

That is a ridiculous request.

I'm stating *verifiable facts* about the flaw and how curl is vulnerable to it. The publicly available information this is based on is the actual source code which is openly available. You can also verify my claims by running code and checking what happens and then you'd see that my statements match what the code does.

The fact that you assess the severity of this (and other) CVE without understanding the basic facts of how it works and what the vulnerability is, just emphasizes how futile your work is: it does not work. If you do not even bother to figure these things out then of course you cannot set a sensible severity level or CVSS score. Now I understand your failures much better.

We in the curl project's security team already know how curl works, we understand this vulnerability and we set the severity accordingly. We don't need to restate known facts. curl functionality is well documented and its source code has always been open and public.

If you have questions after having read that, feel free to reach out to the curl security team and we can help you. You reach us at security@curl.se

I recommend that you (NVD) always talk to us before you set CVSS scores for curl issues so that we can help guide you through them. I think that could make the world a better place and it would certainly benefit a world of curl users who trust the info you provide.

 / Daniel

Games curl too

Several years ago when someone highlighted the fact for me that curl was credited in the ending sequence of the megahit game Grand Theft Auto V, I got a brief moment of acknowledgement from my kids that I might be doing cool stuff before they forgot and moved on.

GTA V ending sequence

Later I would find curl credits in more games and it started to become somewhat of a pattern. I collect screenshotted curl credits and awesome people help me point them out. Many of them are from games.

Finding out they use curl is rarely straight forward. They virtually never told us before hand. Many list the curl license somewhere, sometimes you can find the DLL and some actually include a mention in on-screen credits displays.

Fortnite uses libcurl
PUBG: Battlegrounds uses libcurl
ROBLOX uses libcurl

As this little subset shows, some of the most popular games in the world use libcurl. We rarely get to know exactly for what purpose they use libcurl but we are left to guessing and assumptions. Modern games do a fair amount of internet transfers and what better library to do that with?

Game consoles bundle it

libcurl is also shipped as an OS component in several game consoles.

Nintendo Switch
Valve Steam Deck
Sony Playstation 5
Microsoft Xbox 360

List of curl credits

libcurl (often mentioned as plain curl) has been frequently used in games for almost twenty years already. Doom 3, from 2004 uses libcurl, just as well as well as Diablo IV, released just days ago in June 2023.

Doom 3 from 2004 used libcurl
Diablo IV from 2023 uses libcurl

The site mobygames.com maintains a database of people getting credits in games. The entry for me, since my name is the one used in the curl license, right now lists me (curl really) credited in no less than 136 games – and then the two games listed immediately above here are not even mentioned there so there are reasons to suspect there are others missing as well. Also, as mentioned before, the fact that a game is using curl is sometimes a well hidden secret.

What for?

I have not been closely involved with any of the makers of these games. I don’t have insights or special knowledge about exactly what they use libcurl for. Whatever download or upload purposes they have I guess.

Why curl?

Because curl is Capable, Ubiquitous , Reliable and Libre. It knows how to transfer data in many different ways, is feature packed and it performs well. It runs everywhere so the API works on all platforms. It has proven itself stable and solid for decades without breaking APIs or ABIs. The being available cheap (at no purchase cost at least) is probably also a strong contributing factor combined with the others.

Parallel curl tests

The curl test suite was born in November 2000. We wrote our own custom system, dedicated for us.

In May 2001 we changed the file format for individual tests and this is still today the format we use. During the Twenty-two years that have passed we have added some 1600 test cases to the collection and we make sure that they can run on virtually any platform and that each test case themselves specify what curl features they require to work so that builds with those features disabled can skip those tests.

Only a thorough test suite provides the necessary confidence you need to promise to users that we keep existing behaviors and yet we still can and do repeatedly rewrite, refactor and replace large chunks of the internals.

Synchronous in a single thread

In 2000 we all had single cores and single CPUs. We made the test suite run the tests one by one, in a serial fashion. Some are quick, some take a little longer. While CPUs certainly have grown significantly faster over the lifetime of curl, the amount of test cases have also grown.

Today, on my fast modern machine, running all test cases in the main test suite takes about 10 minutes. If we run them with valgrind enabled (it then invokes all curl related commands and functions with the valgrind tool to monitor that it doesn’t do any serious memory violations or leaks), the same process takes close to 30 minutes.

This might not sound terribly bad, but it also not unusual to run the tests on slower machines that spend two or maybe even five times longer to completion. If you want to run the tests on a few different build combinations to make sure they are all happy, you may need to rerun the set a number of times. It all adds up.

This is a rather ineffective use of time and available system resources. In researching and measuring the current state of curl testing, Dan Fandrich figured out that in a normal test round the CPU is idle 80% of the time! And that’s just one core.

Illustration from Dan’s “curl Parallel Testing Proposal”

Going parallel

In March 2023, Dan brought his curl Parallel Testing proposal (11 page PDF) to us, outlining an idea on how to convert the current single-threaded serial test runner into one that runs many separate worker processes and can run several test cases in parallel.

The general idea being that even on a single-core machine, running tests in parallel has the chance to speed up the process a lot. Because of that 80% number if nothing else.

Most (curl) developers of course also have machine with several or even many cores, making parallelism an even better idea.

We all loved the idea, gave Dan our thumbs up and arranged to fund his work on this improvement.

Port numbers

curl does Internet transfers, and for testing curl we have a set of test servers implemented that curl can talk to and get response back from. The specific tests control exactly how these servers respond and act for each test. To make sure that curl speaks the protocols correctly and consistently in both good and bad situations.

A challenge with this is that the test suite actually has to fire up and run actual networking servers on the local machine for this purpose. Each such server has to listen to a dedicated TCP or UDP port for as long as the tests are still going.

Luckily, we reworked the port number use for test servers recently. Using fixed port numbers for test servers was problematic already with single threaded tests because you could not run a separate test case in a different shell on the same machine etc. They would also sometimes collide with other random services running on developers’ machines.

Since August 2020 all test servers listen on random port numbers. A fundamental criteria for being able to run tests in parallel.

Landed

After a lot of hard work to refactor the test internals, it can now fire up N worker processes, where each such process can run its own set of test servers, then make sure the main scheduler hands out test cases to all of the workers and collects and outputs the test results from all of them. On June 5, Dan merged the commits to master that made it possible for all of us to start test (!) driving this.

First impressions

Dan recommends maybe 7 workers per core, but it might be a little bit limited to how much system memory you have since every such worker might end up running a fairly large amount of test servers. It also depends on if you run the tests with or without valgrind.

I ran a first simple test shot on my machine using 80 workers. A full valgrind enabled round with 1606 tests completed in 87 seconds. That is more than twenty times faster than previously.

Some further polish needed

There are still some issues left that make the parallel test setup a little shakier than the normal serial style, so we do not yet enable this by default for people. We will work on fixing those issues and iron out the last wrinkles so that we can soon get everyone onboard on this.

But man, this is a good step forward!

How?

make -sj
cd tests
make -sj
./runtests.pl -j80

curl 8.1.2 ate one too

This is the second follow-up patch release in the 8.1.x series due to regressions and bugs that are too annoying to leave lingering around.

Release video

Numbers

the 219th release
0 changes
7 days (total: 9,202)

14 bug-fixes (total: 9,045)
22 commits (total: 30,429
0 new public libcurl function (total: 91)
0 new curl_easy_setopt() option (total: 302)

0 new curl command line option (total: 251)
13 contributors, 3 new (total: 2,888)
5 authors, 2 new (total: 1,150)
0 security fixes (total: 145)

Bugfixes

configure: quote the assignments for run-compiler

A regression introduced in the previous release made configure fail if the $CC shell variable was set to something else than just a single command name. This now quotes the variable correctly.

configure: without pkg-config and no custom path, use -lnghttp2

Installations without pkg-config where nghttp2 is installed in a default directory would get a link error in the build.

http2: fix EOF handling on uploads with auth negotiation

This was a regression when using HTTP/2 for doing multi-phase authentication methods with POST, like for example Digest.

http3: send EOF indicator early as possible

By better tracking the amount of upload data, curl can avoid a superfluous final zero-length DATA packet and instead send the EOF sooner.

libcurl.m4: remove trailing ‘dnl’ that causes this to break autoconf

The configure macro we ship for other projects to use to detect installed libcurl version now works better.

libssh: when keyboard-interactive auth fails, try password

When a SSH server allows multiple auth methods, and curl tried keyboard-interactive it would wrongly skip trying the password method – if built to use libssh. This bug has been present all since libssh support shipped.

The Gemini protocol seen by this HTTP client person

There is again a pull-request submitted to the curl project to bring support for the Gemini protocol. It seems like a worthwhile effort that I support, even if it is also a lot of work involved and it might take some time before it reaches the state in which it can be merged. A previous attempt at doing this was abandoned a while ago.

This renewed interest made me take a fresh tour through the current Gemini protocol spec and I decided to write down some observations for you. So here I am. These are comments based on my reading of the 0.16.1 version of the protocol spec. I have implemented Internet application protocols client side for some thirty years. I have not actually implemented the Gemini protocol.

Motivations for existence

Gemini is the result of a kind of a movement that tries to act against some developments they think are wrong on the current web. Gemini is not only a new wire protocol, but also features a new documentation format and more. They also say its not “the web” at all but a new thing. As a sign of this, the protocol is designed by the pseudonymous “Solderpunk” – and the IETF or other suitable or capable organizations have not been involved – and it shows.

Counter surveillance

Gemini has no cookies, no negotiations, no authentication, no compression and basically no (other) headers either in a stated effort to prevent surveillance and tracking. It instead insists on using TLS client certificates (!) for keeping state between requests.

A Gemini response from a server is just a two-digit response code, a single media type and the binary payload. Nothing else.

Reduce complexity

They insist that thanks to reduced complexity it enables more implementations, both servers and clients, and that seems logical. The reduced complexity however also makes it less visually pleasing to users and by taking shortcuts in the protocol, it risks adding complexities elsewhere instead. Its quite similar to going back to GOPHER.

Form over content

This value judgement is repeated among Gemini fans. They think “the web” favors form over content and they say Gemini intentionally is the opposite. It seems to be true because Gemini documents certainly are never visually very attractive. Like GOPHER.

But of course, the protocol is also so simple that it lacks the power to do a lot of things you can otherwise do on the web.

The spec

The only protocol specification is a single fairly short page that documents the over-the-wire format mostly in plain English (undoubtedly featuring interpretation conflicts), includes the URL format specification (very briefly) and oddly enough also features the text/gemini media type: a new document format that is “a kind of lightweight hypertext format, which takes inspiration from gophermaps and from Markdown“.

The spec says “Although not finalised yet, further changes to the specification are likely to be relatively small.” The protocol itself however has no version number or anything and there is no room for doing a Gemini v2 in a forward-compatible way. This way of a “living document” seems to be popular these days, even if rather problematic for implementers.

Gopher revival

The Gemini protocol reeks of GOPHER and HTTP/0.9 vibes. Application protocol style anno mid 1990s with TLS on top. Designed to serve single small text documents from servers you have a relation to.

Short-lived connections

The protocol enforces closing the connection after every response, forcibly making connection reuse impossible. This is terrible for performance if you ever want to get more than one resource off a server. I also presume (but there is no mention of this in the spec) that they discourage use of TLS session ids/tickets for subsequent transfers (since they can be used for tracking), making subsequent transfers even slower.

We know from HTTP and a primary reason for the introduction of HTTP/1.1 back in 1997 that doing short-lived bursty TCP connections makes it almost impossible to reach high transfer speeds due to the slow-starts. Also, re-doing the TCP and TLS handshakes over and over could also be seen a plain energy waste.

The main reason they went with this design seem to be to avoid having a way to signal the size of payloads or do some kind of “chunked” transfers. Easier to document and to implement: yes. But also slower and more wasteful.

Serving an average HTML page using a number of linked resources/images over this protocol is going to be significantly slower than with HTTP/1.1 or later. Especially for servers far away. My guess is that people will not serve “normal” HTML content over this protocol.

Gemini only exists done over TLS. There is no clear text version.

GET-only

There are no other methods or ways to send data to the server besides the query component of the URL. There are no POST or PUT equivalents. There is basically only a GET method. In fact, there is no method at all but it is implied to be “GET”.

The request is also size-limited to a 1024 byte URL so even using the query method, a Gemini client cannot send much data to a server. More on the URL further down.

Query

There is a mechanism for a server to send back a single-line prompt asking for “text input” which a client then can pass to it in the URL query component in a follow-up request. But there is no extra meta data or syntax, just a single line text prompt (no longer than 1024 bytes) and free form “text” sent back.

There is nothing written about how a client should deal with the existing query part in this situation. Like if you want to send a query and answer the prompt. Or how to deal with the fact that the entire URL, including the now added query part, still needs to fit within the URL size limit.

Better use a short host name and a short path name to be able to send as much data as possible.

TOFU

the strongly RECOMMENDED approach is to implement a lightweight “TOFU” certificate-pinning system which treats self-signed certificates as first- class citizens.

(From the Gemini protocol spec 0.16.1 section 4.2)

Trust on first use (TOFU) as a concept works fairly well when you interface a limited set of servers with which you have some relationship. Therefore it often works fine for SSH for example. (I say “fine” for even with ssh, people often have the habit of just saying yes and accepting changed keys even when they perhaps should not.)

There are multiple problems with doing TOFU for a client/server document browsing system like Gemini.

A challenge is of course that on the first visit a client cannot spot an impostor, and neither can it when the server updates its certificates down the line. Maybe an attacker did it? It trains users on just saying “yes” when asked if they should trust it. Since you as a user might not have a clue about how runs that particular server or whatever the reason is why the certificate changes.

The concept of storing certificates to compare against later is a scaling challenge in multiple dimensions:

  • Certificates need to be stored for a long time (years?)
  • Each host name + port number combination has its own certificate. In a world that goes beyond thousands of Gemini hosts, this becomes a challenge for clients to deal with in a convenient (and fast) manner.
  • Presumably each user on a system has its own certificate store. What user A trusts, user B does not necessarily have to trust.
  • Does each Gemini client keep its own certificate store? Do they share? Who can update? How do they update the store? What’s the file format? A common db somehow?
  • When storing the certificates, you might also want to do like modern SSH does: not store the host names in cleartext as it is a rather big privacy leak showing exactly which servers you have visited.

I strongly suspect that many existing Gemini clients avoid this huge mess by simply not verifying the server certificates at all or by just storing the certificates temporarily in memory.

You can opt to store a hash or fingerprint of the certificate instead of the whole one, but that does not change things much.

I think insisting on TOFU is one of Gemini’s weakest links and I cannot see how this system can ever scale to a larger audience or even just many servers. I foresee that they need to accept Certificate Authorities or use DANE in a future.

Gemini Proxying

By insisting on passing on the entire URL in the requests, it is primarily a way to solve name based virtual hosting, but it is also easy for a Gemini server to act as a proxy for other servers. On purpose. And maybe I should write “easy”.

Since Gemini is (supposed to be) end-to-end TLS, proxying requests to another server is not actually possible while also maintaining security. The proxy would have to for example respond with the certificate retrieved from the remote server (in addition to its own) but the spec mentions nothing of this so we can guess existing clients and proxies don’t do it. I think this can be fixed by just adjusting the spec. But would add some rather kludgy complexity for a maybe a not too exciting feature.

Proxying to gopher:// URLs should be possible with the existing wording because there is no TLS to the server end. It could also proxy http:// URLs too but risk having to download the entire thing first before it can send the response.

URLs

The Gemini URL scheme is explained in 138 words, which is of course very terse and assumes quite a lot. It includes “This scheme is syntactically compatible with the generic URI syntax defined in RFC 3986“.

The spec then goes on to explain that the URL needs be UTF-8 encoded when sent over the wire, which I find peculiar because a normal RFC 3986 URL is just a set of plain octets. A Gemini client thus needs to know the charset that was used for or to assume for the original URL in order to convert it to UTF-8.

Example: if there is a %C5 in the URL and the charset was ISO-8859-1. That means the octet is a LATIN CAPITAL LETTER A WITH RING ABOVE. The UTF-8 version of said character is the two-byte sequence 0xC3 0x85. But if the original charset instead was ISO-8859-6, the same %C5 octet means ARABIC LETTER ALEF WITH HAMZA BELOW, encoded as 0xD8 0xA5 in UTF-8.

To me this does not rhyme well with reduced complexity. This conversion alone will cause challenges when done in curl because applications pass an RFC 3986 URL to the library and it does not currently have enough information on how to convert that to UTF-8. Not to mention that libcurl completely lacks UTF-8 conversion functions.

This makes me suspect that the intention is probably that only the host name in the URL should be UTF-8 encoded for IDN reasons and the rest should be left as-is? The spec could use a few more words to explain this.

One of the Gemini clients that I checked out to see how they do this, in order to better understand the spec, even use the punycode version of the host name quoting “Pending possible Gemini spec change”. What is left to UTF-8 then? That client did not UTF-8 encode anything of the URL, which adds to my suspicion that people don’t actually follow this spec detail but rather just interoperate…

The UTF-8 converted version of the URL must not be longer than 1024 bytes when included in a Gemini request.

The fact that the URL size limit is for the UTF-8 encoded version of the URL makes it hard to error out early because the source version of the URL might be shorter than 1024 bytes only to have it grow past the size limit in the encoding phase.

Origin

The document is carelessly thinking “host name” is a good authority boundary to TLS client certificates, totally ignoring the fact that “the web” learned this lesson long time ago. It needs to restrict it to the host name plus port number. Not doing that opens up Gemini for rather bad security flaws. This can be fixed by improving the spec.

Media type

The text/gemini media type should simply be moved out the protocol spec and be put elsewhere. It documents content that may or may not be transferred over Gemini. Similarly, we don’t document HTML in the HTTP spec.

Misunderstandings?

I am fairly sure that once I press publish on this blog post, some people will insist that I have misunderstood parts or most of the protocol spec. I think that is entirely plausible and kind of my point: the spec is written in such an open-ended way that it will not avoid this. We basically cannot implement this protocol by only reading the spec.

Future?

It is impossible to tell if this will fly for real or not. This is not a protocol designed for the masses to replace anything at high volumes. That is of course totally fine and it can still serve its community perfectly fine. There seems to be interest enough to keep the protocol and ecosystem alive for the moment at least. Possibly for a long time into the future as well.

What I would change

As I believe you might have picked up by now, I am not a big fan of this protocol but I still believe it can work and serve its community. If anyone would ask me, here are a few things I would consider changing in order to take it up a few notches.

  1. Split the spec into three separate ones: protocol, URL syntax, media type. Expand the protocol parts with more exact syntax descriptions and examples to supplement the English.
  2. Clarify the client certificate use to be origin based, not host name.
  3. Drop the TOFU idea, it makes for a too weak security story that does not scale and introduces massive complexities for clients.
  4. Clarify the UTF-8 encoding requirement for URLs. It is confusing and possibly bringing in a lot of complexity. Simplify?
  5. Clarify how proxying is actually supposed to work in regards to TLS and secure connections. Maybe drop the proxy idea completely to keep the simplicity.
  6. Consider a way to re-use connections, even if that means introducing some kind of “chunks” HTTP-style.

Discussion

Hacker news

curl user survey 2023

For widely used, widely distributed open source project such as curl, we often have little to no relation at all with our users and therefore it is hard to get feedback and learn what works and what is less good.

Our best and primary way is thus simply to ask users every year how they use curl.

user survey

For the tenth consecutive year, we put together a survey and we ask everyone we know and can reach who ever used curl or library within the last year, to donate a few minutes of their precious time and give us their honest opinions.

The survey is anonymous but hosted by Google. We do not care who you are, but we want to know how you think curl works for you.

The survey will remain online for submissions during 14 days. From Thursday May 25 2023 until midnight (CEST) Wednseday June 7 2023. Please tell your friends about it!

user survey

Post survey analysis

At June 5 the painstaking work of analyzing the results and putting together a summary and presentation begins. It usually takes me a few weeks to complete. Once that is done, the results will be shared for the entire world to enjoy.

Then we see what the curl project should take home and do as a direct result of what users say. Updating procedures, writing documentation and adding features to the roadmap are among the things that can happen and has happened after previous surveys.

user survey

Polhemsrådet

I was invited, and I have accepted, to become a member of Polhemsrådet, the “Polhem Council”, that works for the Polhem Prize nomination committee and serves to appoint the award winners.

I consider it a great honor to get to serve on this board. I am not an engineer by education, but I do know my way around a few engineer topics and in particular things around software and computer related technologies.

This assignment is done on a voluntary basis, there is no money involved. I am joining a council chock-full of intimidatingly impressive people as its seventh member.

The Polhem Prize, which I was awarded in 2017, is Sweden’s oldest engineering award. It was first awarded a person in 1878.

The Polhem Prize is awarded for “a [Swedish] high-level technical innovation or an ingenious solution to a technical problem. The innovation must be available on the open market and be competitive. It has to be sustainable and environmentally friendly.”

More details about the prize, how it works and other council members can be found on the Swedish site for Polhemspriset.