Tag Archives: documentation

curl’s built-in manual without nroff

On December 14 1998 we released curl 5.2.

The project was still early back then and lots of things had not settled yet. In that release, which came only two weeks after 5.1, we introduced the --manual option, or -M for short.

Long before I started working on curl I learnt to value and appreciate Unix manpages. I more or less learned C programming using them, and I certainly learned my first ways around Unix shells and command lines reading manpages. My first Unix I spent a lot of time on was AIX. It was in the early 1990s, several years before I first used Linux.

Since some systems don’t have the fine concept of manpages, I decided I would help those users by bundling the curl man page into the tool itself. You can ask curl to show the curl manpage, with the -M option. The entire thing, looking very similar – mostly just lacking font details such as bold, italics and underline.

How do you bundle a manpage?

I suppose there are many ways to go about to make such a thing happen. In our case, we were already making and shipping a manpage in the nroff manpage format so it became a question of generating a text version using that page as a source and then convert the text version into C source code.

Converting a manpage to text was done with nroff. nroff is an ancient Unix tool that has been around for a long time and it existed on virtually every Unix flavor already back then. It seemed like a no-brainer to go with that, so that is what the curl build system would use.

Once the build scripts were tweaked it continued to just work. It became problematic only on platforms that lacked nroff – but to help smooth over that obstacle we also shipped the generated source file in distribution tarballs.

nroff really?

nroff is quirky tool. It generates the output differently based on environment details and over the years it would also subtly change its output several times that forced us to adjust the scripts as well.

Still, for as long as the curl manpage was primarily written in nroff format, it was challenging to generate the ASCII version any other way. We stuck with nroff.

Source format change

Earlier this year I blogged about how we finally changed the format of all the curl documentation files that create man pages in the curl project. We switched over to using markdown all over.

Even after that switch we still generated the built-in manual with nroff from the curl.1 manpage, that then was created entirely from a large set of source files written in markdown. The manpage was generated by our own custom tool.

The time was ripe

With firm control of the input file format and generating the output entirely with our own tool, it became a viable (and attractive) option to tweak the tool to offer an alternative output format. Allow it to render the output either as a manpage formatted file, or as an ASCII text file. Without involving or using nroff.

The time had come. We had suffered long enough. It was time to address this friction in the build system.

Yesterday, I merged the pull-request that finally, after 25 years, 2 months, 21 days removed the use of nroff from the curl build scripts.

The curl -M output after this change is not 100% identical, but it is close enough and looks very good and similar in style as before. I did not actually even try to make it a complete clone. In fact, when we generate the output directly from markdown instead of going via the manpage, we can actually make it a better text-only version than we could before.

I opted to still use a justified right margin of the text, because that is what it always used and after some casual initial comparisons I think it looked better than without an aligned right column.

nroff does hyphenation of words, which helps somewhat to make justified text easier and nicer, and our own script does not – at least not until I have figured out a decent way to do it. Like if the word “variable” is the last word on a line, it could be written as “vari-” on the end of the line and “able” could start the next line. I believe doing it badly is worse than not doing it at all.

Building this is easier

It is now (much) easier to build this from source, even on esoteric platforms like Windows.

I don’t think a single person will miss the old way of doing this.

curl docs format evolution

I trust you have figured out already that I have the highest ambitions for the curl documentation. I want everything documented in a clear, easy-to-read and easy-to-find manner. This takes a lot of work and is not something that happens without effort.

Part of providing best-in-class documentation has in my mind always been to provide and ship man pages. And to make sure that those latest up-to-date man pages are always offered to users as good-looking webpages on the curl site.

Over twenty years ago I decided to render the web versions of the man pages by creating my own tool for that purpose: roffit. We still use it on the site today. It is a simple tool that converts nroff formatted man pages to HTML and adds cross-referencing links to other man pages. There is no concept of links in nroff, but I decided to use a combination of text and markup to use as a signal and it has worked out fine since. In part of course because I have then also subsequently made sure to format the curl man pages in this roffit-friendly way.

curl.1

We started writing the curl man pages in the nroff format some time in the beginning of 1999. Mostly then because we did not know or have any tool that would convert nicely into nroff and later on to make sure we got things formatted the way we wanted. This approach works and can be used, but nroff is not a terribly human-friendly format and over time we decided that documenting several hundred command line options in a single nroff file was not ideal.

In November 2016 we revamped the system and started to generate the curl.1 man page at build time from multiple source files. Every command line option would then be documented in its own stand-alone file. Easier to read, easier to manage. Yet we controlled the output nroff format so that the web version would still look excellent. The individual files got .d extensions (for document) but were mostly still using nroff, and were mostly all appended to create the result.

Over time, we introduced more convenience formatting aids to the .d files so that we could write them using less and less nroff, and as a result they also grew more readable in their source format.

The .d file format never had its own name. It is just how we documented 250+ command line options for the curl command line tool.

The libcurl docs however remained authored in nroff. In 2014, the number of libcurl man pages exploded when we started doing a separate document for every libcurl option. We went from 59 man pages in the project to 270 between two releases. Since then we have added many more. Today in early 2024, we have a total of 496 individual man pages in the project. 493 of them concerns libcurl.

Markdown

When curl started, markdown did not exist – that format was created first in 2004. The early documentation we created in the project that was not nroff, we made using plain text files. It was first in 2014 when we slowly started to convert some of our text documents into markdown.

Markdown has several upsides compared to plain old text. In particular it is easier to render good-looking web versions of them, which we do on the curl web site and it offers cross-document linking etc. Over the years we have converted almost all our text documents into markdown. Markdown is also a user-friend format as it is easy to read and easy to edit for contributors, new and old.

curldown

The move away from using nroff for the libcurl man pages came with the introduction of the “curldown” document format in early 2024. Meant to make it easy to generate the same kind of nroff files we previously hand-edited, but using “markdown”. I use quotes around it because it is not a full-fledged markdown support, but it looks similar enough to trick casual users and we decided to use “.md” file extensions to make editors, GitHub and more treat and show them as such.

The file format is inspired by the one we created for the command line tool: with a header holding meta-data and then simplified markdown. Easy to read, easy to edit – for everyone. No more strange nroff syntax to copy and paste. The fact that GitHub render the sources files clean and good-looking is also attractive.

Switching to a markdown lookalike format had some other side-effects: suddenly a lot of CI jobs we have running caught these files and had a go at them: spellchecks, prose checks, checks for capitalized words after periods and more. This improved the language significantly.

Also, generating the nroff files rather than editing them has more benefits than just avoiding the baroque syntax: it also allows us to change styles and escape certain sequences through-out all documentation with just a few fixes in the scripts producing the output.

After this 47,000 lines added and removed patch was merged, we have a completely new source format system for the libcurl docs, that generates the same style of nroff output we had previously. With improved language.

But then why not…

With the huge switch to “curldown” for libcurl options, it felt wrong to keep the old curl .d files for the curl.1 man page, seeing as they were almost there already. I did the work and converted them over to the same markdown lookalike style I did for libcurl: they are now also .md files and now they too get spell-checked and scrutinized much more.

I took this move one step further: previously we generated the single curl.1 man page using a fixed header file and a footer file that we prepended and appended to the nroff generated from the (converted) .d documentation.

Now, we instead have a mainpage.idx file that lists which (markdown) file names to include and convert to nroff. This allows us to split the two big headers and footers into several smaller markdown files, each with its own header. Like “DESCRIPTION”, “URL”, and “GLOBBING”. Separating them makes them easier to manage for contributors and the index file makes it easy to change the order of sections etc.

We end up with a curl.1 man page that looks quite similar to before, but with improve language and easier format to edit the documentation going forward.

When I type this, we have 836 files with .md extensions in the curl git repository. At the time of the most recent release, we had 61.

Future

I’m sure it does not stop here. I already have some ideas on how to improve this setup further.

I’m considering doing something to completely avoid the nroff step for the website HTML versions of the man pages and instead go straight from markdown.

The curl tool has the manpage built-in, (shown with the -M option), and I would like to fix the build to do this without using nroff.

I have some more CI jobs to add to verify and check the language in the markdown files. Mostly to make sure the language is consistent.

mastering libcurl

On November 16 2023, I will do a multi-hour tutorial video on how to use libcurl. How to master it. As a follow-up to my previous video class called mastering the curl command line (at 13,000+ views right now).

This event will run as a live-stream webinar combination. You can opt to join via zoom or twitch. Zoom users can ask questions using voice, twitch viewers can do the same using the chat. For attending via Zoom you need to register, for twitch you can just show up.

It starts at 18:00 CET (09:00 PST, 17:00 UTC)

This session is part one of the planned two-part series. The amount of content is simply too much for me to deliver in a single sitting with intact quality. The second part will happen the following Monday, on November 20. Same time.

Both sessions will be recorded and will be made available for browsing after the fact.

At the time of me putting this blog post up the presentation and therefore agenda is not complete. I have also not yet figured out exactly how to do the split between the two episodes. Below you will find the planned topics that will be covered over the two episodes. Details will of course change in the final presentation.

The idea is that very little about developing with libcurl should be left out. This is the most thorough, most advanced, most in-depth libcurl video you ever saw. And it will be shock-full with source code examples. All examples that are shown in the video, are also provided stand-alone for easy browsing, copy and paste and later reading in a dedicated mastering libcurl GitHub repository.

Part one

The slides for part one.

Part two

The slides for part two.

Mastering libcurl

The project

  • Reminders about the curl project

getting it

  • installing
  • building
  • debugging
  • free support
  • paid support

API and ABI

  • compatibility
  • versions
  • the API is for C
  • header files
  • compiling libcurl programs
  • documentation

Architecture

  • C89
  • backends
  • everything is non-blocking

API fundamentals

  • content ignorant / URLs / callbacks
  • basic by default
  • global init
  • easy handles
  • easy options
  • curl_easy_setopt
  • curl_easy_perform
  • curl_easy_cleanup
  • write callback
  • multi handle
  • multi perform
  • curl_multi_info_read
  • curl_multi_cleanup
  • a multi example
  • caches
  • curl_multi_socket_action
  • curl_easy_getinfo

Puppeteering

  • verbose
    debug function
    tracing
  • curl_version / curl_version_info
  • persistent connections
  • multiplexing
  • Downloads
  • Storing downloads
  • Compression
  • Multiple downloads
  • Maximum file size
  • Resuming and ranges
  • Uploads
  • Multiple uploads
  • Transfer controls
  • Stop slow transfers
  • Rate limiting
  • Connections
  • Name resolve tricks
  • Connection timeout
  • Network interface
  • Local port number
  • Keep alive
  • Timeouts
  • Authentication
  • .netrc
  • return codes
  • –libcurl
  • post transfer meta-data
  • caches
  • some words on threads
  • error handling with libcurl

Share API

  • sharing data between easy handles

TLS

  • enable TLS
  • ciphers
  • verifying server certificates
    custom checks
  • client certificates
  • on TLS backends
  • SSLKEYLOGFILE

Proxies

  • Proxy type
  • HTTP proxy
  • SOCKS proxy (tor)
  • Authentication
  • HTTPS proxy
  • Proxy environment variables
  • Proxy headers

HTTP

  • Ranges
  • HTTP versions
  • Conditionals
  • HTTP POST
    data with callback
  • Multipart formpost
  • Redirects
  • Modify the HTTP request
  • HTTP PUT
  • Cookies
  • Alternative Services
  • HSTS
  • HTTP/2
  • HTTP/3

HTTP header API

  • get specific header field after transfer
  • iterate over many headers

URL API

  • Parse a URL
  • extract components
  • update components
  • URL encoding/decoding
  • IDN encoding/decoding
  • redirects

WebSocket

  • just a quickie, I did a separate websocket video recently
    https://youtu.be/NLIhd0wYO24

Deviating from specs

tldr: we do not particular keep track nor document curl’s exact spec compliance. I cannot fathom how we could.

Today, in October 2022, curl and libcurl combined consist of nearly 150,000 lines of source code (not counting blank lines). 19% of those are comments.

This source code pile was carefully crafted with the sole purpose of performing Internet transfers using one or more of the 28 separate supported protocols. (There are 28 different supported URL schemes, it can be discussed if they are also 28 protocols or not.)

Which specs does curl use

It was recently proposed to me that we should document which RFCs curl adheres to and follows, and what deviants there are. In the name of helping the users understand what to expect from curl and educating the world how curl will behave.

This is indeed a noble idea and a worthy goal. We do not want to surprise users. We want them to know.

It was suggested that it might have a security impact if curl would deviate from a spec and if this is not documented clearly, users could be mislead.

What specs

curl speaks TCP/IP (and UDP or QUIC at times), it does DNS and DNS-over-HTTPS, it speaks over proxies and it speaks a range of various application protocols to perform what asked of it. There are literally hundreds of RFCs to read to catch up on all the details.

A while ago I collected the what I consider most important RFCs to read to figure out how curl works and why. That is right now 149 specification documents at a total of over 300,000 lines of text. (It was not done very scientifically.)

Counting the words in these 149 documents, they add up to a total of many more words than the entire Harry Potter series, and the Lord of the Rings series (including the Hobbit) is far behind together with War and Peace: 1.6 megawords.

Luckily, specs are mostly reference literature and we rarely have to read through them all to start our journey, but we often need to go back to check details.

Everything changes over time

The origins of curl trace back to late 1996 and it has been in constant development since then. curl, the Internet and the specifications have all changed significantly over these years.

The specifications that were around when we started have generally been updated multiple times, while we struggle to maintain behavior and functionality for our users. It is hard to spot and react to minor changes in specification updates. They might have been done to clarify a situation, but sometimes such a clarification ends up triggering a functionality change in our code.

Sometimes an update to a spec is even largely ignored by fellow protocol implementers out there in the world, and for the sake of interoperability we too then need to adjust our interpretations so that we work similarly to our peers.

Expectations from users change as values and terms are established in people’s minds rather than in specs. For example: what exactly is the “URL” you see in the browser’s top bar?

Over time, other tools and programs that also work on URLs and on the Internet, gradually change as they too development and slowly morph into the new beings we did not foresee decades ago. This change perceptions and expectations in the user base at large.

The always changing nature of the Internet creates interoperability challenges ever so often: out of the blue a team of protocol implementers can decide to interpret an existing term or a passage in a specification differently one day. When the whole world takes a turn like that, we are sometimes forced to follow along as that is then the new world view.

Another complication is also that curl uses (several) third party libraries for parts of its operations, and some of those details are of course also covered by RFCs.

Guidelines

Our primary guidelines when performing Internet transfers are:

  1. Follow established standard protocol specifications
  2. Security is a first-tier property
  3. Interop widely
  4. Maintain behavior for existing features

As you can figure out for yourself, these four bullet points often collide with each other. Checking off all four is not always possible. They can be hard enough on their own.

Protocol specifications

There are conflicting specifications. Specifications vary over time. They can be hard to interpret to figure out exactly what they say one should do.

Security

Increasing security might at the same time break existing use cases for existing users. It might violate what the specs say. It might add friction in the ability to interoperate with others. It might not even be allowed according to specifications.

Interop

This often mean to not follow specifications they way we want to read them, because apparently others do not read them the same way or sometimes they just disregard what the specs say. At times, it is hard to increase security levels by default because it would hamper interop with others.

Maintain behavior

The scripts written 15 years ago that use curl should continue working. The applications written to use libcurl can upgrade libcurl and its Internet transfers just continue. We do not break existing established behaviors. This may very well conflict both with interop and protocol updates, and sometimes it is hard to tighten the security because it would hurt a certain share of existing users.

How does curl deviate from which specs?

I consider this question more or less impossible to answer to, to document and to keep accurate over time. At least it would be a huge and energy-consuming effort both to get the list done but it would also be a monster task to maintain. And it would involve a lot of gray zones.

What is important to me is not what RFCs curl follows nor what or how it deviates from them. I have also basically never gotten that question from a user.

Users want reliable Internet transfers that are secure and interoperate correctly and conveniently with other “players” out there. They want consistent behavior and backwards compatibility.

If you use curl to perform feature X over protocol Y version Z, does it matter which set of RFCs that this would touch and does anyone care about the struggles we have been through when we implemented this set? How many users can even grasp or follow the implication of mentioning that for RFC XYX section A.B we decided to disregard a SHOULD NOT at times?

And how on earth would we keep that up-to-date when we do bugfixes and RFCs are updated down the line?

No one else documents this

The browsers have several hundreds of paid engineers on staff involved and they do not provide documentation like this. Neither does any curl alternative or competitor to my knowledge.

I don’t know of any tool or software anywhere that offer such a deviance documentation and I can perfectly understand and sympathize with why that is so.

Taking curl documentation quality up one more notch

Tldr; test and verify as much as possible also in the documentation.

I’m a sloppy typist. When I write several words in a row, like for example when creating complete sentences for something like a blog post, one or two of the words end up slightly misspelled.

Sure, many editors and systems have runtime spellchecks these days and they make it easy to quickly fix typos, but not all systems are like that and there are also situations where there are many false positives due to formatting or just the range of “special” words. They also rarely yell at me when I overuse the word “very” or start sentences with “But”.

curl documentation

I work fiercely on making the curl and libcurl documentation top-notch state of the art good and complete. I want my users to feel that. Everything is documented; clearly and with details and examples.

I want and aim for libcurl to be the best documented software library in the world.

Good documentation does not come for free or easily. It requires dedicated work and a lot of effort put into it.

This is of course a never-ending effort as things change over time and we have an almost ridiculous amount of options and details to document.

The key to improve ourselves is of course two good old classics: tests and CI jobs. This works great even for documentation, and perhaps in particular for technical documentation that includes lots of symbols and name references that need to be correct.

As I have recently worked on tightening some bolts and made it harder to land typos, I wanted to take the opportunity to describe some of our ways.

symbols-in-versions

Early 2009 I had some interactions with people in the git project and we discussed their use of libcurl. As we introduce new features to curl over time, users who build with curl may want to write their code to conditionally use the new stuff if they have a new enough libcurl installed, or just skip those features if the installation is too old. git is an application like that. They use libcurl a lot and they offer to build with libcurl installations that are maybe a dozen years old.

I then created a file in the libcurl git repository that I named symbols-in-versions. It lists all publicly provided curl symbols and in which libcurl release they were introduced. A good resource for libcurl users. It took quite an effort to figure them all out after the fact.

Over time, the number of entries in this file has grown significantly.

Tests

Of course, in order to do good CI jobs, they need to have tests to run so we start there.

I will mention some test numbers below. The test numbers in curl do not have any inherent meaning, they are just unique identifiers. To help us find the test source files and refer to tests and their failures easily.

Test 1119

Test 1119 was introduced in November 2010 as I realized I needed to make sure that symbols-in-versions (SIV) is kept up-to-date. It will be a useless document if it lags behind or misses symbols. It needs to include them all and the info needs to be correct.

I wrote a script that extracts all globally provided symbols in some curl header files and then verifies that they are all listed in SIV.

This test now made it very clear when we forgot to add a name to SIV, and it also pointed out if one of the names in SIV for example had a typo.

Test 1139

Scan SIV, figure out all existing options provided for three key libcurl functions: curl_easy_setopt, curl_multi_setopt and curl_easy_getinfo. Then verify that they all are mentioned in the respective “main” man page (curl_easy_setopt.3 etc), where they refer to the individual separate page for the option.

This test also verifies that the curl tool’s man page (curl.1) lists exactly the same set of command line options as is listed in the tool’s source code file tool_getparam.c and that is shown in the tool’s --help output. Consistency is king.

Test 1167

To make sure the symbols we provide in libcurl header files all use the correct name space we created test 1167. Using the correct name space in this context means that all publicly provided symbols need to start with curl of libcurl, case insensitively. It is important for several reasons, first of course because a good library does not pollute the name space to risk collisions and problems, but also using the correct prefix is important so that test 1119 finds all the symbols correctly. So they need to use the right prefix, and when they do, they are scanned and verified correctly.

Test 1173

For libcurl we have several function calls that take options. In some cases these functions accept a very large amount of different options. Every such option is documented in its own dedicated man page. Over time, with lots of contributors working on the project, the different man pages were not all including the same information in the same order and a huge portion of them even missed one of the most important details in programming documentation: examples.

Test 1173 checks all libcurl man pages and verifies that they have the eight mandatory libcurl sections present (NAME, SYNOPSIS, DESCRIPTION etc) and that they all are in the right order and that there is an example section that is more than 2 lines.

This test also does basic nroff formatting verification so that we know the page will look decent in a man page viewer too.

Helps us greatly – especially when we add new man pages.

Proselint

The tool that taught me to stop using the word “very” also finds a lot of other common bad takes on English is called proselint. Since a while back we run a CI job that runs proselint on all markdown files in the curl git repository. It helps us detect and edit away some amount of bad language.

Spellcheck

At the time of this writing, there are 482 individual libcurl related man pages and there is a total of around 85,000 lines of documentation in the project. I decided we should run a spellcheck on these man pages in an attempt to reduce the number of typos and mistakes.

The CI job I created for this first strips out some sections from the man pages that we deem too hard to spellcheck: the SYNOPSIS and the EXAMPLES sections for example. The script also removes all names that look like public curl symbols, as spellchecking them with a normal spellchecker is just impossible and they need special treatment. See further below for that.

Finally, we convert the stripped man page versions into markdown – because we have no spellchecker tools for nroff – and then spellcheck those.

It took far many more hours than I had anticipated to eradicate all the spelling mistakes and we ended up with an custom dictionary with over 800 words that aspell does not like but that I insist are valid for us.

Verify curl symbols

As I mentioned above, we strip out the curl symbols to hide them from the spellchecker.

Instead I extended the test 1119 mentioned above to also scan through all the libcurl man pages and find every single mention of something that looks like the name of a public curl symbol – and then match those against the names present in SIV and output an error if a symbol was referenced that was not documented already and therefore not actually a public curl symbol. With this, no man page can reference a non-existing curl symbol. Every such typo is detected.

Links for reporting on docs bugs

No matter how hard we try, there will always be errors that sneak in anyway and there will be sentences and phrasing that might have felt good at the time of writing but later, in the view of someone else, do not communicate the right message or maybe mislead users to misunderstand functionality.

Bug reports on documentation is key to finding such warts so that we can correct them. In the curl project we make it as easy as possible to report bugs in documentation by providing direct links on virtually all man pages shown on the website. The link takes you directly to the “new issue” page with a template subject filled in with the man page’s name.

This convenience unfortunately leads to a certain amount of “issue spam” but I think that is still a fairly cheap price to pay.

Everything curl

The book is a treasure trove of additional and complementary curl documentation but it is actually written and maintained outside of the curl repository. It has its own set of CI tests, including proselint and spellchecks.

Further

All these tests have been added gradually and slowly over a long period. It gives us time to polish and work out possible flaws in the tests and lets us make sure the work as intended and don’t block development.

I don’t have any immediate pending new pull requests for checking the curl documentation but if there still are details in there that we can check that we currently do not, I am sure that we will find those over time and make sure we verify them too.

If you have ideas and suggestions, I am all ears.

Related

Making world-class docs takes effort

New HTTP core specs

Before this, the latest refreshed specification of HTTP/1.1 was done in the RFC 7230 series, published in June 2014. After that, HTTP/2 was done in the spring of 2015 and recently the HTTP/3 spec has been a work in progress.

To better reflect this new world of multiple HTTP versions and an HTTP protocol ecosystem that has some parts that are common for all versions and some other parts that are specific for each particular version, the team behind this refresh has been working on this updated series.

My favorite documents in this “cluster” are:

HTTP Semantics

RFC 9110 basically describes how HTTP works independently of and across versions.

HTTP/1.1

RFC 9112 replaces 7230.

HTTP/2

RFC 9113 replaces 7540.

HTTP/3

RFC 9114 is finally the version three of the protocol in a published specification.

Credits

Top image by Gerhard G. from Pixabay. The HTTP stack image is done by me, Daniel.

Uncurled

– Everything I know and learned about running and maintaining Open Source projects for three decades.

For several years now, I have had a blog post series in mind to describe something about what people could expect to happen in Open Source projects. I had a few already half-started blog post drafts for some sub topics.

I couldn’t really make up my mind how to craft a series of blog posts about this wide topic in a sensible way so I kept postponing it for later. I did this for years.

A book, it has to be a book

It just dawned on my one day: the only way to get all this into a comprehensible way that also can hold all the thoughts I would like it to have, is to put it into a book. By book, I mean a document. An essay. A collection of pages. A booklet maybe. I don’t know how many words it might end up to become and I have no illusions of it ever ending up in print.

I mean to write the document in the open and provide it for free, online. Open Source style.

Day one

I grabbed my original draft for my blog series “You can expect this in your Open Source project”. I had worked on that document in the background for a long time, adding some little thing here and there over years – and it now had maybe twenty-five “lessons” listed with a short paragraph of text next to each.

I also had started three blog posts based on such lessons that were in pending state here on daniel.haxx.se in my queue of drafts.

I first copied the blog post content back into the text file from those potential blog posts, before I deleted them, and converted the entire file to markdown.

I then grouped the “lessons” I had listed in the markdown file and moved them into a few different sections. Like what to expect, code, money, people and project. I put subtitles into separate files for those five main areas.

How hard can it be?

I didn’t want to do a lot of work before I put the thing into git, and I didn’t want to run any private git repository so I had to make a new repo with a name. I went with “How hard can it be” as a working title and created the repo on GitHub. On April 6 I made the first git push with initial contents to that repository.

The first external contributor appeared after just a few minutes with the first pull-request fixing typos. Clearly people are following me on GitHub and spotted the creating of the repository and checked out what it was. I hadn’t told anyone or given any pointers.

I started expanding on subjects in the book.

Let’s get a real title

In the evening of April 7 I posted this question on Twitter:

"If I write a booklet collecting everything I know and learned about running and maintaining Open Source projects for three decades, what should I call it?"

I got a flood of replies. Lots of good ones and also lots of fun and sarcastic ones. The one that I think really talked to me the best was also the shortest: Uncurled.

  • It’s short and sweet
  • It includes a reference to curl without saying it is “a curl book” (it isn’t)
  • The topic is a bit about “untangling” and curl is a project that probably has taught me the most of what I include here
  • It sounds a little like “debriefed” from the curl project, and it is…
  • I can put it up on the domain name un.curl.dev

I figured I could possibly go with a longer subtitle that could explain the book more: “Everything I know and learned about running and maintaining Open Source projects”.

A name

I renamed the GitHub repository and added a description there. I created the URL (by adding the “un” CNAME entry in the “curl.dev” domain) and I setup gitbook.com to render the content to appear on un.curl.dev.

With a little more thoughts and then spilling some beans about my plans in my weekly report on April 8 (but not leaking the URL or repo to anyone yet) that made people provide some more ideas, I added more content.

10,000 words

By the evening of April 9, I surpassed 10,000 words of contents. Still having the contents and the order of everything pretty much in flux and not yet sorted out.

20,000 words

On April 25, I surpassed 20,000 words. It starts to look like something I can announce soon.

Getting there, but not done

The uncurled book is now in a state I think I can show off without feeling embarrassed. I believe I will still need to work on it more going forward to add and polish content and make it more coherent and less of a collection of snippets. I hope that I over time can settle down and gradually slow down the change pace. It will of course also depend a lot on the feedback I get.

Cover

Since it doesn’t exist physically and probably never will, I don’t think it actually needs a cover image, but it would probably be cool to still have one to use as an image and symbol for the book. If someone has a good idea or feels artistically inclined to make one, let me know!

Making world-class docs takes effort

Here are six requirements that I have on a project for it to reach my gold approval for stellar docs. Then something about what I’ve done recently to further improve the docs for curl.

Your docs belong in the code repository

It needs to be next to the code so that authors and contributors can update/read the docs while working on the code or docs. Providing it in a separate repository or otherwise separated will undoubtedly lead to discrepancies sooner or later. Similar to how all wikis are always wrong.

Your docs is not extracted from code

Admit it. If you browse around you will realize that the best documented projects you find never provide that docs generated directly from code. Such generated docs can still provide value, but you will not reach gold level without more effort.

Separated docs encourage more writing and writing by people who are otherwise scared of touching the code or thinking that fixing spelling errors in the docs isn’t worth patching the code for. It also allows for other and better formatting on the docs.

Your docs features examples

Users always ask for (more) examples. You can never go wrong by providing examples. If your docs don’t have enough examples, you’re not doing the docs good enough yet.

You document every API call you provide

Fewer things make me sigh more than when I have to dig up the source code and from that try to figure out exactly how an external and publicly provided API works. Yet this still happens regularly even for libraries that have been released and maintained for decades.

In one fairly recent instance I reported such an omission in a popular library. It took them over two years to add it.

Long-living libraries should also provide information about from which versions certain functionality or options exist or can be expected to work or not work.

Your docs is easily accessible and browsed

Good docs means that we can find what we’re looking for and that the documentation flows and is easily read and understood. Ideally, even simple google searches for API details in your library should lead us to suitable entry points.

Preferably, the documentation should also be provided for proper off-line reading, meaning man pages or something similar that can be browsed when disconnected from the Internet.

Your docs should be easy to contribute to

The docs should be easy for contributors to help out with (independently from the code if desired). That also includes that they should be easy for contributors to build and render locally so that they can test and view their updates while working on them.

Documentation in the curl project

I want the documentation for curl and libcurl to be known, recognized and widely admitted to be world-class.

I want the curl documentation to be of a quality and content to make users not able to find competitors or similar projects with better docs.

Documentation in curl is not an after-thought. It is not a second-tier component. It is a crucial and important foundation that allows users to use, trust and rely on our products. We require that new changes or improved functionality are provided with the corresponding updated and accurate documentation.

We also try to verify and check the docs as much as possible with scanners , tests and tools.

Non-stop iterating is key

I maintain that our documentation is as good as it is today a lot thanks to us very rigidly sticking to our guiding principles: compatibility and not breaking existing behavior. Documentation we wrote decades ago is still valid. It gives us plenty of time to keep refining and polishing the documentation of a feature that doesn’t change. No documentation was perfect already at the first attempt, but after numerous iterations and improvements chances are it is better. Time is on our side. And we are never done, documentation can always be improved.

I’m putting in the work

A few days ago I talked documentation with someone and when doing so I thought about what guiding principles I think we should put on project documentation. What I’ve listed above basically.

In then dawned on me that the current man curl.1 man page is actually not featuring that many examples, in spite of it being 3535 lines long. I pondered a little bit on how best add some, and then dove in and extended our system that generates it from the hundreds of individual files that each describe a single command line option. They should of course all offer at least one example!

Having 242 command line options (as of right now, it will be more soon), that sudden idea that seemed simple enough turned into a quite gruesome work and I spent many hours walking over the options to make up examples. I also made sure that our build system now returns an error if there’s a command option without an example in the documentation! This way, we can be sure that also all future command line options will have examples in the man page.

This made the curl.1 man page grew with over 1200 lines!

libcurl options too

A few years ago I did a manual effort and made sure most man pages for libcurl options include examples, but I never made that into a test or anything so there’s nothing that forces us to stick to this.

Having started this journey, I decided now was the time to add that requirement to the scripts. I extended test case 1173, which already scans all man pages to verify some basic syntax, to also check that man pages for options feature an EXAMPLE section.

There are at this moment 374 stand-alone man pages for libcurl options. Only ten of them were detected to not feature good enough examples and it wasn’t very hard to fix this.

Consistency is good

Having just extended the man page checker, another idea came up.

I made the script also check that each man page features the correct set of sections, in the required order! I’m a true believer in consistency and that using the same set in the same order will make the docs easier to read and find information in, and checking for these things will make sure that all future additions will be forced stick to the same.

The mandatory sections for libcurl option man pages are right now, in this order:

NAME
SYNOPSIS
DESCRIPTION
PROTOCOLS
EXAMPLE
AVAILABILITY
RETURN VALUE
SEE ALSO

These man pages are allowed to have other sections as well, and they can be placed anywhere among the mandatory ones, but the eight section headers that has to be there has to be in that order.

Cross-references

While at it, I also extended the man page scanner to check that all references in all curl man pages to libcurl options are verified to actually refer to existing options, to find typos. Ironically this extra check turned out finding exactly no such typos in the current 463 man pages!

Future

The outcome of the work I write about here will of course be merged asap and will be part of future releases and on the website.

We should keep thinking of more ways to improve the documentation and for more ways to verify and cross-reference things mentioned in the docs to increase its accuracy and detect typos.

If you find a problem, an inferior wording or just something you think we should improve in any curl documentation, file a bug or a PR! We also try to make this as easy as possible for users to do directly from the curl website by providing bug-reporting direct links from the documentation pages.

Updates

I added the sixth “rule” days after the initial post after Willy Tarreau’s feedback.

The libcurl transfer state machine

I’ve worked hard on making the presentation I ended up calling libcurl under the hood. A part of that presentation is spent on explaining the main libcurl transfer state machine and here I’ll try to document some of what, in a written form. Understanding the main transfer state machine in libcurl could be valuable and interesting for anyone who wants to work on libcurl internals and maybe improve it.

Background

The state is kept in easy handle in the struct field called mstate. The source file for this state machine is called multi.c.

An easy handle is always in exactly one of these states for as long as it exists.

This transfer state machine is designed to work for all protocols libcurl supports, but basically no protocol will transition through all states. As you can see in the drawing, there are many different possible transitions from a lot of the states.

libcurl transfer state machine

(click the image for a larger version)

Start

A transfer starts up there above the surface in the INIT state. That’s a yellow box next to the little start button. Basically the boat shows how it goes from INIT to the right over to MSGSENT with it’s finish flag, but the real path is all done under the surface.

The yellow boxes (states) are the ones that exist before or when a connection is setup. The striped background is for all states that has a single and specific connectdata struct associated with the transfer.

CONNECT

If there’s a connection limit, either in total or per host etc, the transfer can get sent to the PENDING state to wait for conditions to change. If not, the state probably moves on to one of the blue ones to resolve host name and connect to the server etc. If a connection could be reused, it can shortcut immediately over to the green DO state.

The green states are all about setting up the connection to a state of fully connected, authenticated and logged in. Ready to send the first request.

DO

The green DO states are all about sending the request with one or more commands so that the file transfer can begin. There are several such states to properly support all protocols but also for historical reasons. We could probably remove a state there by some clever reorgs if we wanted.

PERFORMING

When a request has been issued and the transfer starts, it transitions over to PERFORMING. In the white states data is flowing. Potentially a lot. Potentially in both or either direction. If during the transfer curl finds out that the transfer is faster than allowed, it will move into RATELIMITING until it has cooled down a bit.

DONE

All the post-transfer states are red in the picture. The DONE is the first of them and after having done what it needs to round up the transfer, it disassociates with the connection and moves to COMPLETED. There’s no stripes behind that state. Disassociate here means that the connection is returned back to the connection pool for later reuse, or in the worst case if deemed that it can’t be reused or if the application has instructed it so, closed.

As you’ll note, there’s no disconnect anywhere in the state machine. This is simply because the disconnect is not really a part of the transfer at all.

COMPLETED

This is the end of the road. In this state a message will be created and put in the outgoing queue for the application to read, and then as a final last step it moves over to MSGSENT where nothing more happens.

A typical handle remains in this state until the transfer is reused and restarted, in which it will be set back to the INIT state again and the journey begins again. Possibly with other transfer parameters and URL this time. Or perhaps not.

State machines within each state

What this state diagram and explanation doesn’t show is of course that in each of these states, there can be protocol specific handling and each of those functions might in themselves of course have their own state machines to control what to do and how to handle the protocol details.

Each protocol in libcurl has its own “protocol handler” and most of the protocol specific stuff in libcurl is then done by calls from the generic parts to the protocol specific parts with calls like protocol_handler->proto_connect() that calls the protocol specific connection procedure.

This allows the generic state machine described in this blog post to not really know the protocol specifics and yet all the currently support 26 transfer protocols can be supported.

libcurl under the hood – the video

Here’s the full video of libcurl under the hood.

If you want to skip directly to the state machine diagram and the following explanation, go here.

Credits

Image by doria150 from Pixabay

curl help remodeled

curl 4.8 was released in 1998 and contained 46 command line options. curl --help would list them all. A decent set of options.

When we released curl 7.72.0 a few weeks ago, it contained 232 options… and curl --help still listed all available options.

What was once a long list of options grew over the decades into a crazy long wall of text shock to users who would enter this command and option, thinking they would figure out what command line options to try next.

–help me if you can

We’ve known about this usability flaw for a while but it took us some time to figure out how to approach it and decide what the best next step would be. Until this year when long time curl veteran Dan Fandrich did his presentation at curl up 2020 titled –help me if you can.

Emil Engler subsequently picked up the challenge and converted ideas surfaced by Dan into reality and proper code. Today we merged the refreshed and improved --help behavior in curl.

Perhaps the most notable change in curl for many users in a long time. Targeted for inclusion in the pending 7.73.0 release.

help categories

First out, curl --help will now by default only list a small subset of the most “important” and frequently used options. No massive wall, no shock. Not even necessary to pipe to more or less to see proper.

Then: each curl command line option now has one or more categories, and the help system can be asked to just show command line options belonging to the particular category that you’re interested in.

For example, let’s imagine you’re interested in seeing what curl options provide for your HTTP operations:

$ curl --help http
Usage: curl [options…]
http: HTTP and HTTPS protocol options
--alt-svc Enable alt-svc with this cache file
--anyauth Pick any authentication method
--compressed Request compressed response
-b, --cookie Send cookies from string/file
-c, --cookie-jar Write cookies to after operation
-d, --data HTTP POST data
--data-ascii HTTP POST ASCII data
--data-binary HTTP POST binary data
--data-raw HTTP POST data, '@' allowed
--data-urlencode HTTP POST data url encoded
--digest Use HTTP Digest Authentication
[...]

list categories

To figure out what help categories that exists, just ask with curl --help category, which will show you a list of the current twenty-two categories: auth, connection, curl, dns, file, ftp, http, imap, misc, output, pop3, post, proxy, scp, sftp, smtp, ssh, telnet ,tftp, tls, upload and verbose. It will also display a brief description of each category.

Each command line option can be put into multiple categories, so the same one may be displayed in both in the “http” category as well as in “upload” or “auth” etc.

--help all

You can of course still get the old list of every single command line option by issuing curl --help all. Handy for grepping the list and more.

“important”

The meta category “important” is what we use for the options that we show when just curl --help is issued. Presumably those options should be the most important, in some ways.

Credits

Code by Emil Engler. Ideas and research by Dan Fandrich.