Bug finding is slow in spite of many eyeballs

“given enough eyeballs, all bugs are shallow”

The saying (also known as Linus’ law) doesn’t say that the bugs are found fast and neither does it say who finds them. My version of the law would be much more cynical, something like: “eventually, bugs are found“, emphasizing the ‘eventually’ part.

(Jim Zemlin apparently said the other day that it can work the Linus way, if we just fund the eyeballs to watch. I don’t think that’s the way the saying originally intended.)

Because in reality, many many bugs are never really found by all those given “eyeballs” in the first place. They are found when someone trips over a problem and is annoyed enough to go searching for the culprit, the reason for the malfunction. Even if the code is open and has been around for years it doesn’t necessarily mean that any of all the people who casually read the code or single-stepped over it will actually ever discover the flaws in the logic. The last few years several world-shaking bugs turned out to have existed for decades until discovered. In code that had been read by lots of people – over and over.

So sure, in the end the bugs were found and fixed. I would argue though that it wasn’t because the projects or problems were given enough eyeballs. Some of those problems were found in extremely popular and widely used projects. They were found because eventually someone accidentally ran into a problem and started digging for the reason.

Time until discovery in the curl project

I decided to see how it looks in the curl project. A project near and dear to me. To take it up a notch, we’ll look only at security flaws. Not only because they are the probably most important bugs we’ve had but also because those are the ones we have the most carefully noted meta-data for. Like when they were reported, when they were introduced and when they were fixed.

We have no less than 30 logged vulnerabilities for curl and libcurl so far through-out our history, spread out over the past 16 years. I’ve spent some time going through them to see if there’s a pattern or something that sticks out that we should put some extra attention to in order to improve our processes and code. While doing this I gathered some random info about what we’ve found so far.

On average, each security problem had been present in the code for 2100 days when fixed – that’s more than five and a half years. On average! That means they survived about 30 releases each. If bugs truly are shallow, it is still certainly not a fast processes.

Perhaps you think these 30 bugs are really tricky, deeply hidden and complicated logic monsters that would explain the time they took to get found? Nope, I would say that every single one of them are pretty obvious once you spot them and none of them take a very long time for a reviewer to understand.

Vulnerability ages

This first graph (click it for the large version) shows the period each problem remained in the code for the 30 different problems, in number of days. The leftmost bar is the most recent flaw and the bar on the right the oldest vulnerability. The red line shows the trend and the green is the average.

The trend is clearly that the bugs are around longer before they are found, but since the project is also growing older all the time it sort of comes naturally and isn’t necessarily a sign of us getting worse at finding them. The average age of flaws is aging slower than the project itself.

Reports per year

How have the reports been distributed over the years? We have a  fairly linear increase in number of lines of code but yet the reports were submitted like this (now it goes from oldest to the left and most recent on the right – click for the large version):

vuln-trend

Compare that to this chart below over lines of code added in the project (chart from openhub and shows blanks in green, comments in grey and code in blue, click it for the large version):

curl source code growth

We received twice as many security reports in 2014 as in 2013 and we got half of all our reports during the last two years. Clearly we have gotten more eyes on the code or perhaps users pay more attention to problems or are generally more likely to see the security angle of problems? It is hard to say but clearly the frequency of security reports has increased a lot lately. (Note that I here count the report year, not the year we announced the particular problems, as they sometimes were done on the following year if the report happened late in the year.)

On average, we publish information about a found flaw 19 days after it was reported to us. We seem to have became slightly worse at this over time, the last two years the average has been 25 days.

Did people find the problems by reading code?

In general, no. Sure people read code but the typical pattern seems to be that people run into some sort of problem first, then dive in to investigate the root of it and then eventually they spot or learn about the security problem.

(This conclusion is based on my understanding from how people have reported the problems, I have not explicitly asked them about these details.)

Common patterns among the problems?

I went over the bugs and marked them with a bunch of descriptive keywords for each flaw, and then I wrote up a script to see how the frequent the keywords are used. This turned out to describe the flaws more than how they ended up in the code. Out of the 30 flaws, the 10 most used keywords ended up like this, showing number of flaws and the keyword:

9 TLS
9 HTTP
8 cert-check
8 buffer-overflow

6 info-leak
3 URL-parsing
3 openssl
3 NTLM
3 http-headers
3 cookie

I don’t think it is surprising that TLS, HTTP or certificate checking are common areas of security problems. TLS and certs are complicated, HTTP is huge and not easy to get right. curl is mostly C so buffer overflows is a mistake that sneaks in, and I don’t think 27% of the problems tells us that this is a problem we need to handle better. Also, only 2 of the last 15 flaws (13%) were buffer overflows.

The discussion following this blog post is on hacker news.

Tightening Firefox’s HTTP framing – again

An old http1.1 frameCall me crazy, but I’m at it again. First a little resume from our previous episodes in this exciting saga:

Chapter 1: I closed the 10+ year old bug that made the Firefox download manager not detect failed downloads, simply because Firefox didn’t care if the HTTP 1.1 Content-Length was larger than what was actually saved – after the connection potentially was cut off for example. There were additional details, but that was the bigger part.

Chapter 2: After having been included all the way to public release, we got a whole slew of bug reports immediately when Firefox 33 shipped and we had to revert parts of the fix I did.

Chapter 3.

Will it land before it turns 11 years old? The bug was originally submitted 2004-03-16.

Since chapter two of this drama brought back the original bugs again we still have to do something about them. I fully understand if not that many readers of this can even keep up of all this back and forth and juggling of HTTP protocol details, but this time we’re putting back the stricter frame checks with a few extra conditions to allow a few violations to remain but detect and react on others!

Here’s how I addressed this issue. I wanted to make the checks stricter but still allow some common protocol violations.

In particular I needed to allow two particular flaws that have proven to be somewhat common in the wild and were the reasons for the previous fix being backed out again:

A – HTTP chunk-encoded responses that lack the final 0-sized chunk.

B – HTTP gzipped responses where the Content-Length is not the same as the actual contents.

So, in order to allow A + B and yet be able to detect prematurely cut off transfers I decided to:

  1. Detect incomplete chunks then the transfer has ended. So, if a chunk-encoded transfer ends on exactly a chunk boundary we consider that fine. Good: This will allow case (A) to be considered fine. Bad: It will make us not detect a certain amount of cut-offs.
  2. When receiving a gzipped response, we consider a gzip stream that doesn’t end fine according to the gzip decompressing state machine to be a partial transfer. IOW: if a gzipped transfer ends fine according to the decompressor, we do not check for size misalignment. This allows case (B) as long as the content could be decoded.
  3. When receiving HTTP that isn’t content-encoded/compressed (like in case 2) and not chunked (like in case 1), perform the size comparison between Content-Length: and the actual size received and consider a mismatch to mean a NS_ERROR_NET_PARTIAL_TRANSFER error.

Firefox BallPrefs

When my first fix was backed out, it was actually not removed but was just put behind a config string (pref as we call it) named “network.http.enforce-framing.http1“. If you set that to true, Firefox will behave as it did with my original fix applied. It makes the HTTP1.1 framing fairly strict and standard compliant. In order to not mess with that setting that now has been around for a while (and I’ve also had it set to true for a while in my browser and I have not seen any problems with doing it this way), I decided to introduce my new changes pref’ed behind a separate variable.

network.http.enforce-framing.soft” is the new pref that is set to true by default with my patch. It will make Firefox do the detections outlined in 1 – 3 and setting it to false will disable those checks again.

Now I only hope there won’t ever be any chapter 4 in this story… If things go well, this will appear in Firefox 38.

Chromium

But how do they solve these problems in the Chromium project? They have slightly different heuristics (with the small disclaimer that I haven’t read their code for this in a while so details may have changed). First of all, they do not allow a missing final 0-chunk. Then, they basically allow any sort of misaligned size when the content is gzipped.

Update: this patch was subsequently backed out again due to several bug reports about it. I have yet to analyze exactly what went wrong.

HTTP/2 is at 5%

http2 logoHere follow some numbers extracted from my recent HTTP/2 presentation.

First: HTTP/2 is not finalized yet and it is not yet in RFC status, even though things are progressing nicely within the IETF. With some luck we reach RFC status within Q1 this year.

On January 13th 2015, Firefox 35 was released with HTTP/2 enabled by default. Firefox was already running it enabled before that in beta and development versions.

Chrome has also been sporting HTTP/2 support in development versions since many moths back where it could easily be manually enabled. Chrome 40 was the first main release shipped with HTTP/2 enabled by default, but it has so far only been enabled for a very small fraction of the user-base.

On January 28th 2015, Google reported to me by email that they saw HTTP/2 being used in 5% of their global traffic (que all relevant disclaimers that this is not statistically safe numbers). This, close after a shaky period with Google having had their HTTP/2 services disabled through parts of the Christmas holidays (due to bugs) – and as explained above, there’s been no time for any mainstream browser to use HTTP/2 by default for very long!

Further data points: Mozilla collects telemetry data from Firefox users who opted-in to it, and it collects numbers on “HTTP Protocol Version Used on Response”. On February 10, it reports that Firefox 35 users have got their responses to report HTTP/2 in 9% of all responses (out of more than 340 billion reported responses). The Telemetry for Firefox Nightly 38 even reports HTTP/2 in 14% of all responses (based on a much smaller sample collection), which I guess could very well be because users on such a bleeding edge version are more experimental by nature.

In these Firefox stats we see that recently, the number of HTTP/2 responses outnumber the HTTP/1.0 responses 9 to 1.

Changing networks with Linux

A rather long time ago I blogged about my work to better deal with changing networks while Firefox is running, and the change was then pushed for Android and I subsequently pushed the same functionality for Firefox on Mac.

Today I’ve landed yet another change, which detects network changes on Firefox OS and Linux.

Firefox Nightly screenshotAs Firefox OS uses a Linux kernel, I ended up doing the same fix for both the Firefox OS devices as for Firefox on Linux desktop: I open a socket in the AF_NETLINK family and listen on the stream of messages the kernel sends when there are network updates. This way we’re told when the routing tables update or when we get a new IP address etc. I consider this way better than the NotifyIpInterfaceChange() API Windows provides, as this allows us to filter what we’re interested in. The windows API makes that rather complicated and in fact a lot of the times when we get the notification on windows it isn’t clear to me why!

The Mac API way is what I would consider even more obscure, but then I’m not at all used to their way of doing things and how you add things to the event handlers etc.

The journey to the landing of this particular patch was once again long and bumpy and full of sweat in this tradition that seem seems to be my destiny, and this time I ran into problems with the Firefox OS emulator which seems to have some interesting bugs that cause my code to not work properly and as a result of that our automated tests failed: occasionally data sent over a pipe or socketpair doesn’t end up in the receiving end. In my case this means that my signal to the child thread to die would sometimes not be noticed and thus the thread wouldn’t exit and die as intended.

I ended up implementing a work-around that makes it work even if the emulator eats the data by also checking a shared should-I-shutdown-now flag every once in a while. For more specific details on that, see the bug.

http2 explained 1.8

I’ve been updating my “http2 explained” document every nohttp2 logow and then since my original release of it back in April 2014. Today I put up version 1.8 which is one of the bigger updates in a while:

http2 explained

The HTTP/2 Last Call within the IETF ended yesterday and the wire format of the protocol has remained fixed for quite some time now so it seemed like a good moment.

I updated some graphs and images to make them look better and be more personal, I added some new short sections in 8.4 and I refreshed the language in several places. Also, now all links mentioned in footnotes and elsewhere should be properly clickable to make following them a more pleasant experience. And page numbers!

As always, do let me know if you find errors, have questions on the content or think I should add something!

My talks at FOSDEM 2015

fosdem

Sunday 13:00, embedded room (Lameere)

Tile: Internet all the things – using curl in your device

Embedded devices are very often network connected these days. Network connected embedded devices often need to transfer data to and from them as clients, using one or more of the popular internet protocols.

libcurl is the world’s most used and most popular internet transfer library, already used in every imaginable sort of embedded device out there. How did this happen and how do you use libcurl to transfer data to or from your device?

Note that this talk was originally scheduled to be at a different time!

Sunday, 09:00 Mozilla room (UD2.218A)

Title: HTTP/2 right now

HTTP/2 is the new version of the web’s most important and used protocol. Version 2 is due to be out very soon after FOSDEM and I want to inform the audience about what’s going on with the protocol, why it matters to most web developers and users and not the last what its status is at the time of FOSDEM.

My first year at Mozilla

January 13th 2014 I started my fiMozilla dinosaur head logorst day at Mozilla. One year ago exactly today.

It still feels like it was just a very short while ago and I keep having this sense of being a beginner at the company, in the source tree and all over.

One year of networking code work that really at least during periods has not progressed as quickly as I would’ve wished for, and I’ve had some really hair-tearing problems and challenges that have taken me sweat and tears to get through. But I am getting through and I’m enjoying every (oh well, let’s say almost every) moment.

During the year I’ve had the chance to meetup with my team mates twice (in Paris and in Portland) and I’ve managed to attend one IETF (in London) and two special HTTP2 design meetings (in London and NYC).

openhub.net counts 47 commits by me in Firefox and that feels like counting high. bugzilla has tracked activity by me in 107 bug reports through the year.

I’ve barely started. I’ll spend the next year as well improving Firefox networking, hopefully with a higher turnout this year. (I don’t mean to make this sound as if Firefox networking is just me, I’m just speaking for my particular part of the networking team and effort and I let the others speak for themselves!)

Onwards and upwards!

curl, open source and networking