Tag Archives: cURL and libcurl

Fewer mallocs in curl

Today I landed yet another small change to libcurl internals that further reduces the number of small mallocs we do. This time the generic linked list functions got converted to become malloc-less (the way linked list functions should behave, really).

Instrument mallocs

I started out my quest a few weeks ago by instrumenting our memory allocations. This is easy since we have our own memory debug and logging system in curl since many years. Using a debug build of curl I run this script in my build dir:

#!/bin/sh
export CURL_MEMDEBUG=$HOME/tmp/curlmem.log
./src/curl http://localhost
./tests/memanalyze.pl -v $HOME/tmp/curlmem.log

For curl 7.53.1, this counted about 115 memory allocations. Is that many or a few?

The memory log is very basic. To give you an idea what it looks like, here’s an example snippet:

MEM getinfo.c:70 free((nil))
MEM getinfo.c:73 free((nil))
MEM url.c:294 free((nil))
MEM url.c:297 strdup(0x559e7150d616) (24) = 0x559e73760f98
MEM url.c:294 free((nil))
MEM url.c:297 strdup(0x559e7150d62e) (22) = 0x559e73760fc8
MEM multi.c:302 calloc(1,480) = 0x559e73760ff8
MEM hash.c:75 malloc(224) = 0x559e737611f8
MEM hash.c:75 malloc(29152) = 0x559e737a2bc8
MEM hash.c:75 malloc(3104) = 0x559e737a9dc8

Check the log

I then studied the log closer and I realized that there were many small memory allocations done from the same code lines. We clearly had some rather silly code patterns where we would allocate a struct and then add that struct to a linked list or a hash and that code would then subsequently add yet another small struct and similar – and then often do that in a loop.  (I say we here to avoid blaming anyone, but of course I myself am to blame for most of this…)

Those two allocations would always happen in pairs and they would be freed at the same time. I decided to address those. Doing very small (less than say 32 bytes) allocations is also wasteful just due to the very large amount of data in proportion that will be used just to keep track of that tiny little memory area (within the malloc system). Not to mention fragmentation of the heap.

So, fixing the hash code and the linked list code to not use mallocs were immediate and easy ways to remove over 20% of the mallocs for a plain and simple ‘curl http://localhost’ transfer.

At this point I sorted all allocations based on size and checked all the smallest ones. One that stood out was one we made in curl_multi_wait(), a function that is called over and over in a typical curl transfer main loop. I converted it over to use the stack for most typical use cases. Avoiding mallocs in very repeatedly called functions is a good thing.

Recount

Today, the script from above shows that the same “curl localhost” command is down to 80 allocations from the 115 curl 7.53.1 used. Without sacrificing anything really. An easy 26% improvement. Not bad at all!

But okay, since I modified curl_multi_wait() I wanted to also see how it actually improves things for a slightly more advanced transfer. I took the multi-double.c example code, added the call to initiate the memory logging, made it uses curl_multi_wait() and had it download these two URLs in parallel:

http://www.example.com/
http://localhost/512M

The second one being just 512 megabytes of zeroes and the first being a 600 bytes something public html page. Here’s the count-malloc.c code.

First, I brought out 7.53.1 and built the example against that and had the memanalyze script check it:

Mallocs: 33901
Reallocs: 5
Callocs: 24
Strdups: 31
Wcsdups: 0
Frees: 33956
Allocations: 33961
Maximum allocated: 160385

Okay, so it used 160KB of memory totally and it did over 33,900 allocations. But ok, it downloaded over 512 megabytes of data so it makes one malloc per 15KB of data. Good or bad?

Back to git master, the version we call 7.54.1-DEV right now – since we’re not quite sure which version number it’ll become when we release the next release. It can become 7.54.1 or 7.55.0, it has not been determined yet. But I digress, I ran the same modified multi-double.c example again, ran memanalyze on the memory log again and it now reported…

Mallocs: 69
Reallocs: 5
Callocs: 24
Strdups: 31
Wcsdups: 0
Frees: 124
Allocations: 129
Maximum allocated: 153247

I had to look twice. Did I do something wrong? I better run it again just to double-check. The results are the same no matter how many times I run it…

33,961 vs 129

curl_multi_wait() is called a lot of times in a typical transfer, and it had at least one of the memory allocations we normally did during a transfer so removing that single tiny allocation had a pretty dramatic impact on the counter. A normal transfer also moves things in and out of linked lists and hashes a bit, but they too are mostly malloc-less now. Simply put: the remaining allocations are not done in the transfer loop so they’re way less important.

The old curl did 263 times the number of allocations the current does for this example. Or the other way around: the new one does 0.37% the number of allocations the old one did…

As an added bonus, the new one also allocates less memory in total as it decreased that amount by 7KB (4.3%).

Are mallocs important?

In the day and age with many gigabytes of RAM and all, does a few mallocs in a transfer really make a notable difference for mere mortals? What is the impact of 33,832 extra mallocs done for 512MB of data?

To measure what impact these changes have, I decided to compare HTTP transfers from localhost and see if we can see any speed difference. localhost is fine for this test since there’s no network speed limit, but the faster curl is the faster the download will be. The server side will be equally fast/slow since I’ll use the same set for both tests.

I built curl 7.53.1 and curl 7.54.1-DEV identically and ran this command line:

curl http://localhost/80GB -o /dev/null

80 gigabytes downloaded as fast as possible written into the void.

The exact numbers I got for this may not be totally interesting, as it will depend on CPU in the machine, which HTTP server that serves the file and optimization level when I build curl etc. But the relative numbers should still be highly relevant. The old code vs the new.

7.54.1-DEV repeatedly performed 30% faster! The 2200MB/sec in my build of the earlier release increased to over 2900 MB/sec with the current version.

The point here is of course not that it easily can transfer HTTP over 20 Gigabit/sec using a single core on my machine – since there are very few users who actually do that speedy transfers with curl. The point is rather that curl now uses less CPU per byte transferred, which leaves more CPU over to the rest of the system to perform whatever it needs to do. Or to save battery if the device is a portable one.

On the cost of malloc: The 512MB test I did resulted in 33832 more allocations using the old code. The old code transferred HTTP at a rate of about 2200MB/sec. That equals 145,827 mallocs/second – that are now removed! A 600 MB/sec improvement means that curl managed to transfer 4300 bytes extra for each malloc it didn’t do, each second.

Was removing these mallocs hard?

Not at all, it was all straight forward. It is however interesting that there’s still room for changes like this in a project this old. I’ve had this idea for some years and I’m glad I finally took the time to make it happen. Thanks to our test suite I could do this level of “drastic” internal change with a fairly high degree of confidence that I don’t introduce too terrible regressions. Thanks to our APIs being good at hiding internals, this change could be done completely without changing anything for old or new applications.

(Yeah I haven’t shipped the entire change in a release yet so there’s of course a risk that I’ll have to regret my “this was easy” statement…)

Caveats on the numbers

There have been 213 commits in the curl git repo from 7.53.1 till today. There’s a chance one or more other commits than just the pure alloc changes have made a performance impact, even if I can’t think of any.

More?

Are there more “low hanging fruits” to pick here in the similar vein?

Perhaps. We don’t do a lot of performance measurements or comparisons so who knows, we might do more silly things that we could stop doing and do even better. One thing I’ve always wanted to do, but never got around to, was to add daily “monitoring” of memory/mallocs used and how fast curl performs in order to better track when we unknowingly regress in these areas.

Addendum, April 23rd

(Follow-up on some comments on this article that I’ve read on hacker news, Reddit and elsewhere.)

Someone asked and I ran the 80GB download again with ‘time’. Three times each with the old and the new code, and the “middle” run of them showed these timings:

Old code:

real    0m36.705s
user    0m20.176s
sys     0m16.072s

New code:

real    0m29.032s
user    0m12.196s
sys     0m12.820s

The server that hosts this 80GB file is a standard Apache 2.4.25, and the 80GB file is stored on an SSD. The CPU in my machine is a core-i7 3770K 3.50GHz.

Someone also mentioned alloca() as a solution for one of the patches, but alloca() is not portable enough to work as the sole solution, meaning we would have to do ugly #ifdef if we would want to use alloca() there.

curl bug bounty

The curl project is a project driven by volunteers with no financing at all except for a few sponsors who pay for the server hosting and for contributors to work on features and bug fixes on work hours. curl and libcurl are used widely by companies and commercial software so a fair amount of work is done by people during paid work hours.

This said, we don’t have any money in the project. Nada. Zilch. We can’t pay bug bounties or hire people to do specific things for us. We can only ask people or companies to volunteer things or services for us.

This is not a complaint – far from it. It works really well and we have a good stream of contributions, bugs reports and more. We are fortunate enough to make widely used software which gives our project a certain impact in the world.

Bug bounty!

Hacker One coordinates a bug bounty program for flaws that affects “the Internet”, and based on previously paid out bounties, serious flaws in libcurl match that description and can be deemed worthy of bounties. For example, 3000 USD was paid for libcurl: URL request injection (the curl advisory for that flaw) and 1000 USD was paid for libcurl duphandle read out of bounds (the corresponding curl advisory).

I think more flaws in libcurl could’ve met the criteria, but I suspect more people than me haven’t been aware of this possibility for bounties.

I was glad to find out that this bounty program pays out money for libcurl issues and I hope it will motivate people to take an extra look into the inner workings of libcurl and help us improve.

What qualifies?

The bounty program is run and administered completely out of control or insight from the curl project itself and I must underscore that while libcurl issues can qualify, their emphasis is on fixing vulnerabilities in Internet software that have a potentially big impact.

To qualify for this bounty, vulnerabilities must meet the following criteria:

  • Be implementation agnostic: the vulnerability is present in implementations from multiple vendors or a vendor with dominant market share. Do not send vulnerabilities that only impact a single website, product, or project.
  • Be open source: finding manifests itself in at least one popular open source project.

In addition, vulnerabilities should meet most of the following criteria:

  • Be widespread: vulnerability manifests itself across a wide range of products, or impacts a large number of end users.
  • Have critical impact: vulnerability has extreme negative consequences for the general public.
  • Be novel: vulnerability is new or unusual in an interesting way.

If your libcurl security flaw matches this, go ahead and submit your request for a bounty. If you’re at a company using libcurl at scale, consider joining that program as a bounty sponsor!

Yes C is unsafe, but…

I posted curl is C a few days ago and it raced on hacker news, reddit and elsewhere and got well over a thousand comments in those forums alone. The blog post has been read more than 130,000 times so far.

Addendum a few days later

Many commenters of my curl is C post struck down on my claim that most of our security flaws aren’t due to curl being written in C. It turned out into some sort of CVE counting game in some of the threads.

I think that’s missing the point I was trying to make. Even if 75% of them happened due to us using C, that fact alone would still not be a strong enough reason for me to reconsider our language of choice (at this point in time). We use C for a whole range of reasons as I tried to lay out there in spite of the security challenges the language brings. We know C has tricky corners and we know we are likely to do more mistakes going forward.

curl is currently one of the most distributed and most widely used software components in the universe, be it open or proprietary and there are easily way over three billion instances of it running in appliances, servers, computers and devices across the globe. Right now. In your phone. In your car. In your TV. In your computer. Etc.

If we then have had 40, 50 or even 60 security problems because of us using C, through-out our 19 years of history, it really isn’t a whole lot given the scale and time we’re talking about here.

Using another language would’ve caused at least some problems due to that language, plus I feel a need to underscore the fact that none of the memory safe languages anyone would suggest we should switch to have been around for 19 years. A portion of our security bugs were even created in our project before those alternatives you would suggest were available! Let alone as stable and functional alternatives.

This is of course no guarantee that there isn’t still more ugly things to discover or that we won’t mess up royally in the future, but who will throw the first stone when it comes to that? We will continue to work hard on minimizing risks, detecting problems early by ourselves and work closely together with everyone who reports suspected problems to us.

Number of problems as a measurement

The fact that we have 62 CVEs to date (and more will follow surely) is rather a proof that we work hard on fixing bugs, that we have an open process that deals with the problems in the most transparent way we can think of and that people are on their toes looking for these problems. You should not rate a project in any way purely based on the number of CVEs – you really need to investigate what lies behind the numbers if you want to understand and judge the situation.

Future

Let me clarify this too: I can very well imagine a future where we transition to another language or attempt various others things to enhance the project further – security wise and more. I’m not really ruling anything out as I usually only have very vague ideas of what the future might look like. I just don’t expect it to be happening within the next few years.

These “you should switch language” remarks are strangely enough from the backseat drivers of the Internet. Those who can tell us with confidence how to run our project but who don’t actually show us any code.

Languages

What perhaps made me most sad in the aftermath of said previous post, is everyone who failed to hold more than one thought at a time in their heads. In my post I wrote 800 words on some of the reasoning behind us sticking to the language C in the curl project. I specifically did not say that I dislike certain other languages or that any of those alternative languages are bad or should be avoided. Please friends, I wrote about why curl uses C. There are many fine languages out there and you should all use them as much as you possibly can, and I will too – but not in the curl project (at the moment). So no, I don’t hate language XXXX. I didn’t say so, and I didn’t imply it either. Don’t put that label on me, thanks.

curl is C

For some reason, this post got picked up again and is debated today in 2021, almost 4 years since I wrote it. Some things have changed in the mean time and I might’ve phrased a few things differently if I had written this today. But still, what’s here below is what I wrote back then. Enjoy!

Every once in a while someone suggests to me that curl and libcurl would do better if rewritten in a “safe language”. Rust is one such alternative language commonly suggested. This happens especially often when we publish new security vulnerabilities. (Update: I think Rust is a fine language! This post and my stance here has nothing to do with what I think about Rust or other languages, safe or not.)

curl is written in C

The curl code guidelines mandate that we stick to using C89 for any code to be accepted into the repository. C89 (sometimes also called C90) – the oldest possible ANSI C standard. Ancient and conservative.

C is everywhere

This fact has made it possible for projects, companies and people to adopt curl into things using basically any known operating system and whatever CPU architecture you can think of (at least if it was 32bit or larger). No other programming language is as widespread and easily available for everything. This has made curl one of the most portable projects out there and is part of the explanation for curl’s success.

The curl project was also started in the 90s, even long before most of these alternative languages you’d suggest, existed. Heck, for a truly stable project it wouldn’t be responsible to go with a language that isn’t even old enough to start school yet.

Everyone knows C

Perhaps not necessarily true anymore, but at least the knowledge of C is very widespread, where as the current existing alternative languages for sure have more narrow audiences or amount of people that master them.

C is not a safe language

Does writing safe code in C require more carefulness and more “tricks” than writing the same code in a more modern language better designed to be “safe” ? Yes it does. But we’ve done most of that job already and maintaining that level isn’t as hard or troublesome.

We keep scanning the curl code regularly with static code analyzers (we maintain a zero Coverity problems policy) and we run the test suite with valgrind and address sanitizers.

C is not the primary reason for our past vulnerabilities

There. The simple fact is that most of our past vulnerabilities happened because of logical mistakes in the code. Logical mistakes that aren’t really language bound and they would not be fixed simply by changing language.

Of course that leaves a share of problems that could’ve been avoided if we used another language. Buffer overflows, double frees and out of boundary reads etc, but the bulk of our security problems has not happened due to curl being written in C.

C is not a new dependency

It is easy for projects to add a dependency on a library that is written in C since that’s what operating systems and system libraries are written in, still today in 2017. That’s the default. Everyone can build and install such libraries and they’re used and people know how they work.

A library in another language will add that language (and compiler, and debugger and whatever dependencies a libcurl written in that language would need) as a new dependency to a large amount of projects that are themselves written in C or C++ today. Those projects would in many cases downright ignore and reject projects written in “an alternative language”.

curl sits in the boat

In the curl project we’re deliberately conservative and we stick to old standards, to remain a viable and reliable library for everyone. Right now and for the foreseeable future. Things that worked in curl 15 years ago still work like that today. The same way. Users can rely on curl. We stick around. We don’t knee-jerk react to modern trends. We sit still in the boat. We don’t rock it.

Rewriting means adding heaps of bugs

The plain fact, that also isn’t really about languages but is about plain old software engineering: translating or rewriting curl into a new language will introduce a lot of bugs. Bugs that we don’t have today.

Not to mention how rewriting would take a huge effort and a lot of time. That energy can instead today be spent on improving curl further.

What if

If I would start the project today, would I’ve picked another language? Maybe. Maybe not. If memory safety and related issues was the primary concern I had, then sure. But as I’ve mentioned above there are several others concerns too so it would really depend on my priorities.

Finally

At the end of the day the question that remains is: would we gain more than we would pay, and over which time frame? Who would gain and who would lose?

I’m sure that there will be or it may even already exist, curl and libcurl competitors and potent alternatives written in most of these new alternative languages. Some of them are absolutely really good and will get used and reach fame and glory. Some of them will be crap. Just like software always work. Let a thousand curl competitors bloom!

Will curl be rewritten at some point in the future? I won’t rule it out, but I find it unlikely. I find it even more unlikely that it will happen in the short term or within the next few years.

Discuss this post on Hacker news or Reddit!

Followup-post: Yes, C is unsafe, but…

curlup 2017: curl now

At curlup 2017 in Nuremberg, I did a keynote and talked a little about the road to what we are and where we are right now in the curl project. There will hopefully be a recording of this presentation made available soon, but I wanted to entertain you all by also presenting some of the graphs from that presentation in a blog format for easy access and to share the information.

Some stats and numbers from the curl project early 2017. Unless otherwise mentioned, this is based on the availability of data that we have. The git repository has data from December 1999 and we have detailed release information since version 6.0 (September 13, 1999).

Web traffic

First out, web site traffic to curl.haxx.se over the seven last full years that I have stats for. The switch to a HTTPS-only site happened in February 2016. The main explanation to the decrease in spent bandwidth in 2016 is us removing the HTML and PDF versions of all documentation from the release tarballs (October 2016).

My log analyze software also tries to identify “human” traffic so this graph should not include the very large amount of bots and automation that hits our site. In total we serve almost twice the amount of data to “bots” than to human. A large share of those download the cacert.pem file we host.

Since our switch to HTTPS we have a 301 redirect from the HTTP site, and we still suffer from a large number of user-agents hitting us over and over without seemingly following said redirect…

Number of lines in git

Since we also have documentation and related things this isn’t only lines of code. Plain and simply: lines added to files that we have in git, and how the number have increased over time.

There’s one notable dip and one climb and I think they both are related to how we have rearranged documentation and documentation formatting.

Top-4 author’s share

This could also talk about how seriously we suffer from “the bus factor” in this project. Look at how large share of all commits that the top-4 commiters have authored. Not committed; authored. Of course we didn’t have proper separation between authors and committers before git (March 2010).

Interesting to note here is also that the author listed second here is Yang Tse, who hasn’t authored anything since August 2013. Me personally seem to have plateaued at around 57% of all commits during the recent year or two and the top-4 share is slowly decreasing but is still over 80% of the commits.

I hope we can get the top-4 share well below 80% if I rerun this script next year!

Number of authors over time

In comparison to the above graph, I did one that simply counted the total number of unique authors that have contributed a change to git and look at how that number changes over time.

The time before git is, again, somewhat of a lie since we didn’t keep track of authors vs committers properly then so we shouldn’t put too much value into that significant knee we can see on the graph.

To me, the main take away is that in spite of the top-4 graph above, this authors-over-time line is interestingly linear and shows that the vast majority of people who contribute patches only send in one or maybe a couple of changes and then never appear again in the project.

My hope is that this line will continue to climb over the coming years.

Commits per release

We started doing proper git tags for release for curl 6.5. So how many commits have we done between releases ever since? It seems to have gone up and down over time and I added an average number line in this graph which is at about 150 commits per release (and remember that we attempt to do them every 8 weeks since a few years back).

Towards the right we can see the last 20 releases or so showing a pattern of high bar, low bar, and I’ll get to that more in a coming graph.

Of course, counting commits is a rough measurement as they can be big or small, easy or hard, good or bad and this only counts them.

Commits per day

As the release frequency has varied a bit over time I figured I should just check and see how many commits we do in the project per day and see how that has changed (or not) over time. Remember, we are increasing the number of unique authors fairly fast but the top-4 share of “authorship” is fairly stable.

Turns our the number of commits per day has gone up and down a little bit through the git history but I can’t spot any obvious trend here. In recent years we seem to keep up more than 2 commits per day and during intense periods up to 8.

Days per release

Our general plan is since bunch of years back to do releases every 8 weeks like a clock work. 8 weeks is 56 days.

When we run into serious problems, like bugs that are really annoying or tedious to users or if we get a really serious security problem reported, we sometimes decide to go outside of the regular release schedule and ship something before the end of the 8-week cycle.

This graph clearly shows that over the last, say 20, releases we clearly have felt ourselves “forced” to do follow-up releases outside of the regular schedule. The right end of the graph shows a very clear saw-tooth look that proves this.

We’ve also discussed this slightly on the mailing list recently, and I’m certainly willing to go back and listen to people as to what we can do to improve this situation.

Bugfixes per release

We keep close track of all bugfixes done in git and mark them up and mention them in the RELEASE-NOTES document that we ship in every new release.

This makes it possible for us to go back and see how many bug fixes we’ve recorded for each release since curl 6.5. This shows a clear growth over time. It’s interesting since we don’t see this when we count commits, so it may just be attributed to having gotten better at recording the bugs in the files. Or that we now spend fewer commits per bug fix. Hard to tell exactly, but I do enjoy that we fix a lot of bugs…

Days spent per bugfix

Another way to see the data above is to count the number of bug fixes we do over time and just see how many days we need on average to fix bugs.

The last few years we do more bug fixes than there are days so if we keep up the trend this shows for 2017 we might be able to reach down to 0.5 days per bug fix on average. That’d be cool!

Coverity scans

We run coverity scans on the curl cover regularly and this service keeps a little graph for us showing the number of found defects over time. These days we have a policy of never allowing a defect detected by Coverity to linger around. We fix them all and we should have zero detected defects at all times.

The second graph here shows a comparison line with “other projects of comparable size”, indicating that we’re at least not doing badly here.

Vulnerability reports

So in spite of our grand intentions and track record shown above, people keep finding security problems in curl in a higher frequency than every before.

Out of the 24 vulnerabilities reported to the curl project in 2016, 7 was the result of the special security audit that we explicitly asked for, but even if we hadn’t asked for that and they would’ve remained unknown, 17 would still have stood out in this graph.

I do however think that finding – and reporting – security problem is generally more good than bad. The problems these reports have found have generally been around for many years already so this is not a sign of us getting more sloppy in recent years, I take it as a sign that people look for these problems better and report them more often, than before. The industry as a whole looks on security problems and the importance of them differently now than it did years ago.

curl up 2017, the venue

The fist ever physical curl meeting took place this last weekend before curl’s 19th birthday. Today curl turns nineteen years old.

After much work behind the scenes to set this up and arrange everything (and thanks to our awesome sponsors to contributed to this), over twenty eager curl hackers and friends from a handful of countries gathered in a somewhat rough-looking building at curl://up 2017 in Nuremberg, March 18-19 2017.

The venue was in this old factory-like facility but we put up some fancy signs so that people would find it:

Yes, continue around the corner and you’ll find the entrance door for us:

I know, who’d guessed that we would’ve splashed out on this fancy conference center, right? This is the entrance door. Enter and look for the next sign.

Yes, move in here through this door to the right.

And now, up these stairs…

When you’ve come that far, this is basically the view you could experience (before anyone entered the room):

And when Igor Chubin presents about wttr,in and using curl to do console based applications, it looked like this:

It may sound a bit lame to you, but I doubt this would’ve happened at all and it certainly would’ve been less good without our great sponsors who helped us by chipping in what we didn’t want to charge our visitors.

Thank you very much Kippdata, Ergon, Sevenval and Haxx for backing us!

Some curl numbers

We released the 163rd curl release ever today. curl 7.53.0 – approaching 19 years since the first curl release (6914 days to be exact).

It took 61 days since the previous release, during which 47 individuals helped us fix 95 separate bugs. 25 of these contributors were newcomers. In total, we now count more than 1500 individuals credited for their help in the project.

One of those bug-fixes, one was a security vulnerability, upping our total number of vulnerabilities through the years to 62.

Since the previous release, 7.52.1, 155 commits were made to the source repository.

The next curl release, our 164th, is planned to ship in exactly 8 weeks.

Post FOSDEM 2017

I attended FOSDEM again in 2017 and it was as intense, chaotic and wonderful as ever. I met old friends, got new friends and I got to test a whole range of Belgian beers. Oh, and there was also a set of great open source related talks to enjoy!

On Saturday at 2pm I delivered my talk on curl in the main track in the almost frighteningly large room Janson. I estimate that it was almost half full, which would mean upwards 700 people in the audience. The talk itself went well. I got audible responses from the audience several times and I kept well within my given time with time over for questions. The trickiest problem was the audio from the people who asked questions because it wasn’t at all very easy to hear, while the audio is great for the audience and in the video recording. Slightly annoying because as everyone else heard, it made me appear half deaf. Oh well. I got great questions both then and from people approaching me after the talk. The questions and the feedback I get from a talk is really one of the things that makes me appreciate talking the most.

The video of the talk is available, and the slides can also be viewed.

So after I had spent some time discussing curl things and handing out many stickers after my talk, I managed to land in the cafeteria for a while until it was time for me to once again go and perform.

We’re usually a team of friends that hang out during FOSDEM and we all went over to the Mozilla room to be there perhaps 20 minutes before my talk was scheduled and wow, there was a huge crowd outside of that room already waiting by the time we arrived. When the doors then finally opened (about 10 minutes before my talk started), I had to zigzag my way through to get in, and there was a large amount of people who didn’t get in. None of my friends from the cafeteria made it in!

The Mozilla devroom had 363 seats, not a single one was unoccupied and there was people standing along the sides and the back wall. So, an estimated nearly 400 persons in that room saw me speak about HTTP/2 deployments numbers right now, how HTTP/2 doesn’t really work well under 2% packet loss situations and then a bit about how QUIC can solve some of that and what QUIC is and when we might see the first experiments coming with IETF-QUIC – which really isn’t the same as Google-QUIC was.

To be honest, it is hard to deliver a talk in twenty minutes and I  was only 30 seconds over my time. I got questions and after the talk I spent a long time talking with people about HTTP, HTTP/2, QUIC, curl and the future of Internet protocols and transports. Very interesting.

The video of my talk can be seen, and the slides are online too.

I’m not sure if I was just unusually unlucky in my choices, or if there really was more people this year, but I experienced that “FULL” sign more than usual this year.

I fully intend to return again next year. Who knows, maybe I’ll figure out something to talk about then too. See you there?

One URL standard please

Following up on the problem with our current lack of a universal URL standard that I blogged about in May 2016: My URL isn’t your URL. I want a single, unified URL standard that we would all stand behind, support and adhere to.

What triggers me this time, is yet another issue. A friendly curl user sent me this URL:

http://user@example.com:80@daniel.haxx.se

… and pasting this URL into different tools and browsers show that there’s not a wide agreement on how this should work. Is the URL legal in the first place and if so, which host should a client contact?

  • curl treats the ‘@’-character as a separator between userinfo and host name so ‘example.com’ becomes the host name, the port number is 80 followed by rubbish that curl ignores. (wget2, the next-gen wget that’s in development works identically)
  • wget extracts the example.com host name but rejects the port number due to the rubbish after the zero.
  • Edge and Safari say the URL is invalid and don’t go anywhere
  • Firefox and Chrome allow ‘@’ as part of the userinfo, take the ’80’ as a password and the host name then becomes ‘daniel.haxx.se’

The only somewhat modern “spec” for URLs is the WHATWG URL specification. The other major, but now somewhat aged, URL spec is RFC 3986, made by the IETF and published in 2005.

In 2015, URL problem statement and directions was published as an Internet-draft by Masinter and Ruby and it brings up most of the current URL spec problems. Some of them are also discussed in Ruby’s WHATWG URL vs IETF URI post from 2014.

What I would like to see happen…

Which group? A group!

Friends I know in the WHATWG suggest that I should dig in there and help them improve their spec. That would be a good idea if fixing the WHATWG spec would be the ultimate goal. I don’t think it is enough.

The WHATWG is highly browser focused and my interactions with members of that group that I have had in the past, have shown that there is little sympathy there for non-browsers who want to deal with URLs and there is even less sympathy or interest for URL schemes that the popular browsers don’t even support or care about. URLs cover much more than HTTP(S).

I have the feeling that WHATWG people would not like this work to be done within the IETF and vice versa. Since I’d like buy-in from both camps, and any other camps that might have an interest in URLs, this would need to be handled somehow.

It would also be great to get other major URL “consumers” on board, like authors of popular URL parsing libraries, tools and components.

Such a URL group would of course have to agree on the goal and how to get there, but I’ll still provide some additional things I want to see.

Update: I want to emphasize that I do not consider the WHATWG’s job bad, wrong or lost. I think they’ve done a great job at unifying browsers’ treatment of URLs. I don’t mean to belittle that. I just know that this group is only a small subset of the people who probably should be involved in a unified URL standard.

A single fixed spec

I can’t see any compelling reasons why a URL specification couldn’t reach a stable state and get published as *the* URL standard. The “living standard” approach may be fine for certain things (and in particular browsers that update every six weeks), but URLs are supposed to be long-lived and inter-operate far into the future so they really really should not change. Therefore, I think the IETF documentation model could work well for this.

The WHATWG spec documents what browsers do, and browsers do what is documented. At least that’s the theory I’ve been told, and it causes a spinning and never-ending loop that goes against my wish.

Document the format

The WHATWG specification is written in a pseudo code style, describing how a parser would “walk” over the string with a state machine and all. I know some people like that, I find it utterly annoying and really hard to figure out what’s allowed or not. I much more prefer the regular RFC style of describing protocol syntax.

IDNA

Can we please just say that host names in URLs should be handled according to IDNA2008 (RFC 5895)? WHATWG URL doesn’t state any IDNA spec number at all.

Move out irrelevant sections

“Irrelevant” when it comes to documenting the URL format that is. The WHATWG details several things that are related to URL for browsers but are mostly irrelevant to other URL consumers or producers. Like section “5. application/x-www-form-urlencoded” and “6. API”.

They would be better placed in a “URL considerations for browsers” companion document.

Working doesn’t imply sensible

So browsers accept URLs written with thousands of forward slashes instead of two. That is not a good reason for the spec to say that a URL may legitimately contain a thousand slashes. I’m totally convinced there’s no critical content anywhere using such formatted URLs and no soul will be sad if we’d restricted the number to a single-digit. So we should. And yeah, then browsers should reject URLs using more.

The slashes are only an example. The browsers have used a “liberal in what you accept” policy for a lot of things since forever, but we must resist to use that as a basis when nailing down a standard.

The odds of this happening soon?

I know there are individuals interested in seeing the URL situation getting worked on. We’ve seen articles and internet-drafts posted on the issue several times the last few years. Any year now I think we will see some movement for real trying to fix this. I hope I will manage to participate and contribute a little from my end.