I did this 50 minute talk on May 21 2015 for a Swedish company. With tongue in cheek subtitled "from hobby to world domination". I think it turned out pretty decent and covers what the project is, how we work on it and what I do to make it run. Some of the questions are not easy to hear but in general it works out fine. Enjoy!
"given enough eyeballs, all bugs are shallow"
The saying (also known as Linus' law) doesn't say that the bugs are found fast and neither does it say who finds them. My version of the law would be much more cynical, something like: "eventually, bugs are found", emphasizing the 'eventually' part.
(Jim Zemlin apparently said the other day that it can work the Linus way, if we just fund the eyeballs to watch. I don't think that's the way the saying originally intended.)
Because in reality, many many bugs are never really found by all those given "eyeballs" in the first place. They are found when someone trips over a problem and is annoyed enough to go searching for the culprit, the reason for the malfunction. Even if the code is open and has been around for years it doesn't necessarily mean that any of all the people who casually read the code or single-stepped over it will actually ever discover the flaws in the logic. The last few years several world-shaking bugs turned out to have existed for decades until discovered. In code that had been read by lots of people - over and over.
So sure, in the end the bugs were found and fixed. I would argue though that it wasn't because the projects or problems were given enough eyeballs. Some of those problems were found in extremely popular and widely used projects. They were found because eventually someone accidentally ran into a problem and started digging for the reason.
Time until discovery in the curl project
I decided to see how it looks in the curl project. A project near and dear to me. To take it up a notch, we'll look only at security flaws. Not only because they are the probably most important bugs we've had but also because those are the ones we have the most carefully noted meta-data for. Like when they were reported, when they were introduced and when they were fixed.
We have no less than 30 logged vulnerabilities for curl and libcurl so far through-out our history, spread out over the past 16 years. I've spent some time going through them to see if there's a pattern or something that sticks out that we should put some extra attention to in order to improve our processes and code. While doing this I gathered some random info about what we've found so far.
On average, each security problem had been present in the code for 2100 days when fixed - that's more than five and a half years. On average! That means they survived about 30 releases each. If bugs truly are shallow, it is still certainly not a fast processes.
Perhaps you think these 30 bugs are really tricky, deeply hidden and complicated logic monsters that would explain the time they took to get found? Nope, I would say that every single one of them are pretty obvious once you spot them and none of them take a very long time for a reviewer to understand.
This first graph (click it for the large version) shows the period each problem remained in the code for the 30 different problems, in number of days. The leftmost bar is the most recent flaw and the bar on the right the oldest vulnerability. The red line shows the trend and the green is the average.
The trend is clearly that the bugs are around longer before they are found, but since the project is also growing older all the time it sort of comes naturally and isn't necessarily a sign of us getting worse at finding them. The average age of flaws is aging slower than the project itself.
Reports per year
How have the reports been distributed over the years? We have aÂ fairly linear increase in number of lines of code but yet the reports were submitted like this (now it goes from oldest to the left and most recent on the right - click for the large version):
Compare that to this chart below over lines of code added in the project (chart from openhub and shows blanks in green, comments in grey and code in blue, click it for the large version):
We received twice as many security reports in 2014 as in 2013 and we got half of all our reports during the last two years. Clearly we have gotten more eyes on the code or perhaps users pay more attention to problems or are generally more likely to see the security angle of problems? It is hard to say but clearly the frequency of security reports has increased a lot lately. (Note that I here count the report year, not the year we announced the particular problems, as they sometimes were done on the following year if the report happened late in the year.)
On average, we publish information about a found flaw 19 days after it was reported to us. We seem to have became slightly worse at this over time, the last two years the average has been 25 days.
Did people find the problems by reading code?
In general, no. Sure people read code but the typical pattern seems to be that people run into some sort of problem first, then dive in to investigate the root of it and then eventually they spot or learn about the security problem.
(This conclusion is based on my understanding from how people have reported the problems, I have not explicitly asked them about these details.)
Common patterns among the problems?
I went over the bugs and marked them with a bunch of descriptive keywords for each flaw, and then I wrote up a script to see how the frequent the keywords are used. This turned out to describe the flaws more than how they ended up in the code. Out of the 30 flaws, the 10 most used keywords ended up like this, showing number of flaws and the keyword:
I don't think it is surprising that TLS, HTTP or certificate checking are common areas of security problems. TLS and certs are complicated, HTTP is huge and not easy to get right. curl is mostly C so buffer overflows is a mistake that sneaks in, and I don't think 27% of the problems tells us that this is a problem we need to handle better. Also, only 2 of the last 15 flaws (13%) were buffer overflows.
A while ago I wrote about my hunt for a new keyboard, and in my follow-up conversations with friends around that subject I quickly came to the conclusion I should get myself better analysis and data on how I actually use a keyboard and the individual keys on it. And if you know me, you know I like (useless) statistics.
So, I tried out the popular and widely used Linux key-logger software 'logkeys' and immediately figured out that it doesn't really support the precision and detail level I wanted so I forked the project and modified the code to work the way I want it: keyfreq was born. Code on github. (I forked it because I couldn't find any way to send back my modifications to the upstream project, I don't really feel a need for another project.)
Then I fired up the logging process and it has been running in the background for a while now, logging every key stroke with a time stamp.
Counting key frequency and how it gets distributed very quickly turns into basically seeing when I'm active in front of the computer and it also gave me thoughts around what a high key frequency actually means in terms of activity and productivity. Does a really high key frequency really mean that I was working intensely or isn't that purpose more a sign of mail sending time? When I debug problems or research details, won't those periods result in slower key activity?
In the end I guess that over time, the key frequency chart basically says that if I have pressed a lot of keys during a period, I was working on something then. Hours or days with a very low average key frequency are probably times when I don't work as much.
The weekend key frequency is bound to be slightly wrong due to me sometimes doing weekend hacking on other computers where I don't log the keys since my results are recorded from a single specific keyboard only.
So what did I learn? Here are some conclusions and results from 1276614 keystrokes done over a period of the most recent 52 calendar days.
I have a 105-key keyboard, but during this period I only pressed 90 unique keys. Out of the 90 keys I pressed, 3 were pressed more than 5% of the time - each. In fact, those 3 keys are more than 20% of all keystrokes. Those keys are: <Space>, <Backspace> and the letter 'e'.
<Space> stands out from all the rest as it has been used more than 10%.
Only 29 keys were used more than 1% of the presses, giving this a really long tail with lots of keys hardly ever used.
Over this logged time, I have registered key strokes during 46% of all hours. Counting only the hours in which I actually used the keyboard, the average number of key strokes were 2185/hour, 36 keys/minute.
The average week day (excluding weekend days), I registered 32486 key presses. The most active sinngle minute during this logging period, I hit 405 keys. The most active single hour I managed to do 7937 key presses. During weekends my activity is much lower, and then I average at 5778 keys/day (7.2% of all activity were weekends).
When counting most active hours over the day, there are 14 hours that have more than 1% activity and there are 5 with less than 1%, leaving 5 hours with no keyboard activity at all (02:00- 06:59). Interestingly, the hour between 23-24 at night is the single most busy hour for me, with 12.5% of all keypresses during the period.
Longest contiguous time without keys: 26.4 hours
Longest key sequence without backspace: 946
There are 7 keys I only pressed once during this period; 4 of them are on the numerical keypad and the other three are F10, F3 and <Pause>.
I'll try to keep the logging going and see if things change over time or if there later might end up things that can be seen in the data when looked over a longer period.
On October 16th, I visited DSV at Stockholm University where I had the pleasure of holding a talk and discussion with students (and a few teachers) under the topic Contribute to Open Source. Around 30 persons attended.
Here are the slides I use, as usual possibly not perfectly telling stand-alone without the talk but there was no recording made and I talked in Swedish anyway...
I'm officially not taking part in anything related to Rockbox anymore. I've unsubscribed and I'm out.
In the fall of 2001, my friend Linus and my brother BjÃ¶rn had both bought the portable Archos Player, a harddrive based mp3 player and slightly underwhelmed by its firmware they decided they would have a go at trying to improve it. All three of us had been working with embedded systems for many years already and I was immediately attracted to the idea of reverse engineering this kind of device and try to improve it. It sounded like a blast to me.
In December 2001 we had the first test program actually running on the device and flashing a led. The first little step of what would become a rather big effort. We wrote a GPLed mp3 player firmware replacement, entirely from scratch without re-using any original parts. A full home-grown tiny multitasking operating system with a UI.
Fast-forwarding through history: we managed to get a really good firmware done for the early Archos players and we managed to move on to follow-up mp3 players too. After a decade or so, we supported well over 60 different mp3 player models and we played every music format known to man, we usually had better battery life than the original firmwares. We could run doom and we had a video player, a plugin system and a system full of crazy things.
We gathered large amounts of skilled and intelligent hackers from all over the world who contributed to make this possible. We had yearly meetups, or developer conferences, and we hung out on IRC every day of the week. I still hang out on our off-topic IRC channel!
Over time, smart phones emerged as the preferred devices people would use to play music while on the go. We ported Rockbox over to Android as an app, but our pixel-based UI was never really suitable for the flexible Android world and I also think that most contributors were more interested in hacking devices than writing Android apps. The app never really attracted many users or developers so while functional it never "took off".
mp3 players are now already a thing of the past and will soon fall into the cave of forgotten old things our children will never even know or care about.
Developers and users of Rockbox have mostly moved on to other ventures. I too stopped actually contributing to the project several years ago but I was running build clients for a long while and I've kept being subscribed to the development mailing list. Until now. I'm now finally cutting off the last rope. Good bye Rockbox, it was fun while it lasted. I had a massive amount of great fun and I learned a lot while in the project.
I maintain curl and lead the development there. This is how I spend my time an ordinary day in the project. Maybe I don't do all of these things every single day, but sometimes I do and sometimes I just do a subset of them. I just want to give you a look into what I do and why I don't add new stuff more often or faster... I spend about one to three hours on the project every day. Let me also stress that curl is a tiny little project in comparison with many other open source projects. I'm certainly not saying otherwise.
the new bug
Someone submits a new bug in the bug tracker or on one of the mailing lists. Most initial bug reports lack sufficient details so the first thing I do is ask for more info and possibly ask the submitter to try a recent version as very often we get bug reported on very old versions. Many bug reports take several demands for more info before the necessary details have been provided. I don't really start to investigate a problem until I feel I have a sufficient amount of details. We're a very small core team that acts on other people's bugs.
the question by a newbie in the project
A new person shows up with a question. The question is usually similar to a FAQ entry or an example but not exactly. It deserves a proper response. This kind of question can often be answered by anyone, but also most people involved in the project don't feel the need or "familiarity" to respond to such questions and therefore remain quiet.
the old mail I haven't responded to yet
I want every serious email that reaches the mailing lists to get a response, so all mails that neither I nor anyone else responds to I keep around in my inbox and when I have idle time over I go back and catch up on old mails. Some of them can then of course result in a new bug or patch or whatever. Occasionally I have to resort to simply saving away the old mail without responding in order to catch up, just to cut the list of outstanding things to do a little.
the TODO list for my own sake, things I'd like to get working on
There are always things I really want to see done in the project, and I work on them far too little really. But every once in a while I ignore everything else in my life for a couple of hours and spend them on adding a new feature or fixing something I've been missing. Actual development of new features is a very small fraction of all time I spend on this project.
the list of open bug reports
I regularly revisit this list to see what I can do to push the open ones forward. Follow-up questions, deep dives into source code and specifications or just the sad realization that a particular issue won't be fixed within the nearest time (year?) so that I close it as "future" and add the problem to our KNOWN_BUGS document. I strive to keep the bug list clean and only keep relevant bugs open. Those issues that are not reproducible, are left without the proper attention from the reporter or otherwise stall will get closed. In general I feel quite lonely as responder in the bug tracker...
the mailing list threads that are sort of dying but I do want some progress or feedback on
In my primary email inbox I usually keep ongoing threads around. Lots of discussions just silently stop getting more posts and thus slowly wither away further up the list to become forgotten and ignored. With some interval I go back to see if the posters are still around, if there's any more feedback or whatever in order to figure out how to proceed with the subject. Very often this makes me get nothing at all back and instead I just save away the entire conversation thread, forget about it and move on.
the blog post I want to do about a recent change or fix I did I'd like to highlight
I try to explain some changes to the world in blog posts. Not all changes but the ones that are somehow noteworthy as they perhaps change the way things have been or introduce new fun features perhaps not that easily spotted. Of course all features are always documented etc, but sometimes I feel I need to put some extra attention on focus on things in a more free-form style. Or I just write about meta stuff, like this very posting.
the reviewing and merging of patches
One of the most important tasks I have is to review patches. I'm basically the only person in the project who volunteers to review patches against any angle or corner of the project. When people have spent time and effort and gallantly send the results of their labor our way in the best possible format (a patch!), the submitter deserves a good review and proper feedback. Also, paving the road for more patches is one of the best way to scale the project. Helping newcomers become productive is important.
Patches are preferably posted on the mailing lists but there's also some coming in via pull requests on github and while I strongly discourage that (due to them not getting the same attention and possible scrutiny on the list like the others) I sometimes let them through anyway just to be smooth.
When the patch looks good (or sometimes good enough and I just edit some minor detail), I merge it.
the non-disclosed discussions about a potential security problem
We're a small project with a wide reach and security problems can potentially have grave impact on users. We take security seriously, and we very often have at least one non-public discussion going on about a problem in curl that may have security implications. We then often work on phrasing security advisories, working down exactly which versions that are vulnerable, producing patches for at least the most recent ones of those affected versions and so on.
stackoverflow.com has become almost like a wikipedia for source code and programming related issues (although it isn't wiki), and that site is one of the primary referrers to curl's web site these days. I tend to glance over the curl and libcurl related questions and offer my answers at times. If nothing else, it is good to help keeping the amount of disinformation at low levels.
I strongly disapprove of people filing bug reports on such places or even very detailed (lib)curl core questions that should've been asked on the curl-library list.
there are idle times too
Yeah. Not very often, but sometimes I actually just need a day off all this. Sometimes I just don't find motivation or energy enough to dig into that terrible seldom-happening bug on a platform I've never seen personally. A project like this never ends. The same day we release a new release, we just reset our clocks and we're back on improving curl, fixing bugs and cleaning up things for the next release. Forever and ever until the end of time.
Hey, when I just built my own Firefox OS (b2g) image for my Firefox OS Tablet (flatfish) straight from the latest sources, I ran into this (known) problem:
Can't find necessary file(s) of Bluedroid in the backup-flatfish folder. Please update the system image for supporting Bluedroid (Bug-986314), so that the needed binary files can be extracted from your flatfish device.
So, as I struggled to figure out the exact instructions on how to proceed from this, I figured I should jot down what I did in the hopes that it perhaps will help a fellow hacker at some point:
- Download the 3 *.img files from the dropbox site that is referenced from bug 986314.
- Download the flash-flatfish.sh script from the same dropbox place
- Make sure you have 'fastboot' installed (I'm mentioning this here because it turned out I didn't and yet I have already built and flashed my Flame phone successfully without having it). "apt-get install android-tools-fastboot" solved it for me. Note that if it isn't installed, the flash-flatfish.sh script will claim that the device is not in fastboot mode and stop with an error message saying so.
- Finally: run the script "./flash-flatfish.sh [dir with the 3 .img files]"
- Once it has succeeded, the tablet reboots
- Remove the backup-flatfish directory in the build dir.
- Restart the flatfish build again and now it should get passed that Bluedroid nit
When we receive patches, improvements, suggestions, advice and whatever that lead to a change in curl or libcurl, I make an effort to log the contributor's name in association with that change. Ideally, I add a line in the commit message. We use "Reported-by: <full name>" quite frequently but also other forms of "...-by: <full name>" too like when there was an original patch by someone or testing and similar. It shouldn't matter what the nature of the contribution is, if it helped us it is a contribution and we say thanks!
I want all patch providers and all of us who have push rights to use this approach so that we give credit where credit is due. Giving credit is the only payment we can offer in this project and we should do it with generosity.
The green bars on the right show the results from the question how good we are at giving credit in the project from the 2014 curl survey, where 5 is really good and 1 is really bad. Not too shabby, but I'd say we can do even better! (59% checked the top score, 15% checked the 3')
I have a script called contributors.sh that extracts all contributors since a tag (typically the previous release) and I use that to get a list of names to thank in the RELEASE-NOTES file for the pending curl release. Easy and convenient.
After every release (which means every 8th week) I then copy the list of names from RELEASE-NOTES into docs/THANKS. So all contributors get remembered and honored after having helped us in one way or another.
When there's no name
When contributors don't provide a real name but only a nick name like foobar123, user_5678 and so on I tend to consider that as request to not include the person's name anywhere and hence I tend to not include it in the THANKS or RELEASE-NOTES. This also sometimes the result of me not always wanting to bother by asking people over and over again for their real name in case they want to be given proper and detailed credit for what they've provided to us.
Unfortunately, a notable share of all contributions we get to the project are provided by people "hiding" behind a made up handle. I'm fine with that as long as it truly is what the helpers' actually want.
So please, if you help us out, we will happily credit you, but please tell us your name!
Number of followers on twitter: 1,302
Number of commits during the last 365 days at github: 686
Number of publicly visible open source commits counted by openhub: 36,769
Number of questions I've answered on stackoverflow: 403
Number of connections on LinkedIn: 608
Number of days I've committed something in the curl project: 2,869
Number of commits by me, merged into Mozilla Firefox: 9
Number of blog posts on daniel.haxx.se, including this: 734
Number of friends on Facebook: 150
Number of open source projects I've contributed to, openhub again: 35
Number of followers on Google+: 557
Number of tweets: 5,491
Number of mails sent to curl mailing lists: 21,989
TOTAL life achievement: 71,602
Reading through the answers to the curl project's survey "curl and libcurl 2014" is very interesting and educational.
After having lead and participated in this project for so long I have my own picture of what we're good and bad at. That's not exactly the same image I get when I read the survey responses. That's of course the educating part and I really want to learn from this poll and see where to put in some efforts and attempt to improve. At the same time I've been working for a while to put together a roadmap for the project, and the survey will help guide us with that work as well.
The full generated summary of the answers can be found on the site, but I thought I do the extra effort here and try to extrapolate data, compare and try to get to the real story that lurks in the shadows.
Over the almost 10 days the poll was open, we received 194 responses. I was hoping for more participation, but on the other hand I don't think more people would've given a much different view. My only concern would be that I'm not sure exactly how well we reached out.
Almost all curl users use it for HTTP and HTTPS. Sure, we also use a lot of other protocols and in fact all supported protocols did up having at least two users according to the survey, but only a single digit percentage did not mark HTTP and HTTPS as protocols they use. The least used supported protocol gopher, is used among 1.5% of the users who responded.
FTPS and SFTP are basically equally much used and they are the 4th and 5th most used protocols. HTTP, HTTPS and FTP are clearly our most popular protocols.
Only one in five users use curl on a single platform. All others use it on two or more, and one if four use it on four or more with an unexpectedly high 11% saying they use it on 5 or more platforms! That's a pretty strong message to me that our multi-platform strategy is important.
Our users have been with us for a long time. Half of the users have been using curl for five years or more! A fifth has been with us for 8 years or more! And yet there seems to be a healthy amount of newcomers finding us as 14% is within their first year.
The above numbers combined, I'm not surprised but only happy to see that 4 out of 5 users are also involved in other open source projects. curl is just one piece in a large ecosystem and I think it is good that we all participate in several projects so that we learn and cross-pollinate where possible!
Less than half of the respondents are subscribed to a curl mailing list, and curl-library is the most popular one. This also reflects in subscriber numbers on the actual mailing lists where curl-library with its 1400+ members has almost twice as many subscribers as curl-users. One way to view this is that we are old enough, established enough and working enough so that users don't have to subscribe to our lists to keep up. The less optimistic way to see it could be that this is because we haven't reached out good enough or that our mailing list culture/setup isn't welcoming enough.
Perhaps most surprising to me: that several persons got upset and reacted strongly to the question about how good we treat "female and other minorities" in the project. To me there's no doubt that female contributors are a minority in the curl community and I want to learn if we're doing our best to be inclusive and open to all possible contributors. Or at least how good/bad people think we are doing.
29% of the respondents have contributed patches, meaning 56 individuals. I think that tells more about the ones who took part of the survey than it measures participation level among "regular users".
A big revelation for me was the question where I asked people to identify the "worst parts" of the project. The image here below is the look of the summary.
It quite clearly identifies "documentation" as the area in most need of improvements.
I don't think the amount of docs is the problem. After discussing with people I think the primary issues are:
- Some collections of docs are just too big and hard to find in, like the curl man page and the curl_easy_setopt man pages. We need to split them up and/or rearrange somehow to help people find the info they need. Work has started on this. I'll follow up with details later.
- We get slightly bad "reviews" on this when people confuse the libcurl bindings' lack of docs to be our problem. Lots of libcurl bindings are not very good documented - but they are separate projects not controlled or documented by us. I don't know what we can do to help that situation. Suggestions are very welcome!
- We don't have much step-by-step tutorials on how to get started and how to knit things together. We mostly provide reference manuals. I will appreciate help with improving this!