Tag Archives: Development

Testing curl

In order to ship a quality product – once every eight weeks – we need lots of testing. This is what we do to test curl and libcurl.

checksrc

We have basic script that verifies that the source code adheres to our code standard. It doesn’t catch all possible mistakes, but usually it complains with enough details to help contributors to write their code to match the style we already use. Consistent code style makes the code easier to read. Easier reading makes less bugs and quicker debugging.

By doing this check with a script (that can be run automatically when building curl), it makes it easier for everyone to ship properly formatted code.

We have not (yet) managed to convince clang-format or other tools to reformat code to correctly match our style, and we don’t feel like changing it just for the sake of such a tool. I consider this a decent work-around.

make test

The test suite that we bundle with the source code in the git repository has a large number of tests that test…

  • curl – it runs the command line tool against test servers for a large range of protocols and verifies error code, the output, the protocol details and that there are no memory leaks
  • libcurl – we then build many small test programs that use the libcurl API and perform tests against test servers and verifies that they behave correctly and don’t leak memory etc.
  • unit tests – we build small test programs that use libcurl internal functions that aren’t exposed in the API and verify that they behave correctly and generate the presumed output.
  • valgrind – all the tests above can be run with and without valgrind to better detect memory issues
  • “torture” – a special mode that can run the tests above in a way that first runs the entire test, counts the number of memory related functions (malloc, strdup, fopen, etc) that are called and then runs the test again that number of times and for each run it makes one of the memory related functions fail – and makes sure that no memory is leaked in any of those situations and no crash occurs etc. It runs the test over and over until all memory related functions have been made to fail once each.

Right now, a single “make test” runs over 1100 test cases, varying a little depending on exactly what features that are enabled in the build. Without valgrind, running those tests takes about 8 minutes on a reasonably fast machine but still over 25 minutes with valgrind.

Then we of course want to run all tests with different build options…

CI

For every pull request and for every source code commit done, the curl source is built for Linux, mac and windows. With a large set of different build options and TLS libraries selected, and all the tests mentioned above are run for most of these build combinations. Running ‘checksrc’ on the pull requests is of course awesome so that humans don’t have to remark on code style mistakes much. There are around 30 different builds done and verified for each commit.

If any CI build fails, the pull request on github gets a red X to signal that something was not OK.

We also run test case coverage analyses in the CI so that we can quickly detect if we for some reason significantly decrease test coverage or similar.

We use Travis CI, Appveyor and Coveralls.io for this.

Autobuilds

Independently of the CI builds, volunteers run machines that regularly update from git, build and run the entire test suite and then finally email the results back to a central server. These setups help us cover even more platforms, architectures and build combinations. Just with a little longer turn around time.

With millions of build combinations and support for virtually every operating system and CPU architecture under the sun, we have to accept that not everything can be fully tested. But since almost all code is shared for many platforms, we can still be reasonably sure about the code even for targets we don’t test regularly.

Static code analyzing

We run the clang scan-build on the source code daily and we run Coverity scans on the code “regularly”, about once a week.

We always address defects detected by these analyzers immediately when notified.

Fuzzing

We’re happy to be part of Google’s OSS-fuzz effort, which with a little help with integration from us keeps hammering our code with fuzz to make sure we’re solid.

OSS-fuzz has so far resulted in two security advisories for curl and a range of other bug fixes. It hasn’t been going on for very long and based on the number it has detected so far, I expect it to keep finding flaws – at least for a while more into the future.

Fuzzing is really the best way to hammer out bugs. When we’re down to zero detected static analyzer detects and thousands of test cases that all do good, the fuzzers can still continue to find holes in the net.

External

Independently of what we test, there are a large amount of external testing going on, for each curl release we do.

In a presentation by Google at curl up 2017, they mentioned their use of curl in “hundreds of applications” and how each curl release they adopt gets tested more than 400,000 times. We also know a lot of other users also have curl as a core component in their systems and test their installations extensively.

We have a large set of security interested developers who run tests and fuzzers on curl at their own will.

(image from pixabay)

curl author activity illustrated

At the time of each commit, check how many unique authors that had a change committed within the previous 120, 90, 60, 30 and 7 days. Run the script on the curl git repository and then plot a graph of the data, ranging from 2010 until today. This is just under 10,000 commits.

(click for the full resolution version)

git-authors-active.pl is the little stand-alone script I wrote and used for this – should work fine for any git repository. I then made the graph from that using libreoffice.

On billions and “users”

At times when I’ve gone out (yes it happens), faced an audience and talked about my primary spare time project curl, I’ve said a few times in the past that we have one billion users.

Users?

many devices

OK, as this is open source I’m talking about, I can’t actually count my users and what really constitutes “a user” anyway?

If the same human runs multiple copies of curl (in different devices and applications), is that human then counted once or many times? If a single developer writes an application that uses libcurl and that application is used by millions of humans, is that one user or are they millions of curl users?

What about pure machine “users”? In the subway in one of the world’s largest cities, there’s an automated curl transfer being done for every person passing the ticket check point. Yet I don’t think we can count the passing (and unknowing) passengers as curl users…

I’ve had a few people approach me to object to my “curl has one billion users” statement. Surely not one in every seven humans on earth are writing curl command lines! We’re engineers and we’re picky with the definitions.

Because of this, I’m trying to stop talking about “number of users”. That’s not a proper metric for a project whose primary product is a library that is used by applications or within devices. I’m instead trying to assess the number of humans that are using services, tools or devices that are powered by curl. Fun challenge, right?

Who isn’t using?

a userI’ve tried to imagine of what kind of person that would not have or use any piece of hardware or applications that include curl during a typical day. I certainly can’t properly imagine all humans in this vast globe and how they all live their lives, but I quite honestly think that most internet connected humans in the world own or use something that runs my code. Especially if we include people who use online services that use curl.

curl is used in basically all modern TVs, a large percentage of all car infotainment systems, routers, printers, set top boxes, mobile phones and apps on them, tablets, video games, audio equipment, Blu-ray players, hundreds of applications, even in fridges and more. Apple alone have said they have one billion active devices, devices that use curl! Facebook uses curl extensively and they have 1.5 billion users every month. libcurl is commonly used by PHP sites and PHP empowers no less than 82% of the sites w3techs.com has figured out what they run (out of the 10 million most visited sites in the world).

There are about 3 billion internet users worldwide. I seriously believe that most of those use something that is running curl, every day. Where Internet is less used, so is of course curl.

Every human in the connected world, use something powered by curl every day

Frigging Amazing

It is an amazing feeling when I stop and really think about it. When I pause to let it sink in properly. My efforts and code have spread to almost every little corner of the connected world. What an amazing feat and of course I didn’t think it would reach even close to this level. I still have hard time fully absorbing it! What a collaborative success story, because I could never have gotten close to this without the help from others and the community we have around the project.

But it isn’t something I think about much or that make me act very different in my every day life. I still work on the bug reports we get, respond to emails and polish off rough corners here and there as we go forward and keep releasing new curl releases every 8 weeks. Like we’ve done for years. Like I expect us and me to continue doing for the foreseeable future.

It is also a bit scary at times to think of the massive impact it could have if or when a really terrible security flaw is discovered in curl. We’ve had our fair share of security vulnerabilities so far through our history, but we’ve so far been spared from the really terrible ones.

So I’m rich, right?

coins

If I ever start to describe something like this to “ordinary people” (and trust me, I only very rarely try that), questions about money is never far away. Like how come I give it away free and the inevitable “what if everyone using curl would’ve paid you just a cent, then…“.

I’m sure I don’t need to tell you this, but I’ll do it anyway: I give away curl for free as open source and that is a primary reason why it has reached to the point where it is today. It has made people want to help out and bring the features that made it attractive and it has made companies willing to use and trust it. Hadn’t it been open source, it would’ve died off already in the 90s. Forgotten and ignored. And someone else would’ve made the open source version and instead filled the void a curlless world would produce.

No more heartbleeds please

caution-quarantine-areaAs a reaction to the whole Heartbleed thing two years ago, The Linux Foundation started its Core Infrastructure Initiative (CII for short) with the intention to help track down well used but still poorly maintained projects or at least detect which projects that might need help. Where the next Heartbleed might occur.

A bunch of companies putting in money to improve projects that need help. Sounds almost like a fairy tale to me!

Census

In order to identify which projects to help, they run their Census Project: “The Census represents CII’s current view of the open source ecosystem and which projects are at risk.

The Census automatically extracts a lot of different meta data about open source projects in order to deduce a “Risk Index” for each project. Once you’ve assembled such a great data trove for a busload of projects, you can sort them all based on that risk index number and then you basically end up with a list of projects in a priority order that you can go through and throw code at. Or however they deem the help should be offered.

Which projects will fail?

The old blog post How you know your Free or Open Source Software Project is doomed to FAIL provides such a way, but it isn’t that easy to follow programmatically. The foundation has its own 88 page white paper detailing its methods and algorithm.

Risk Index

  • A project without a web site gets a point
  • If the project has had four or more CVEs (publicly disclosed security vulnerabilities) since 2010, it receives 3 points and if fewer than four there’s a diminishing scale.
  • The number of contributors the last 12 months is a rather heavy factor, which thus could make the index grow old fairly quick. 3 contributors still give 4 points.
  • Popular packages based on Debian’s popcon get points.
  • If the project’s main language is C or C++, it gets two points.
  • Network “exposed” projects get points.
  • some additional details like dependencies and how many outstanding patches not accepted upstream that exist

All combined, this grades projects’ “risk” between 0 and 15.

Not high enough resolution

Assuming that a larger number of CVEs means anything bad is just wrong. Even the most careful and active projects can potentially have large amounts of CVEs. It means they disclose what they find and that people are actually reviewing code, finding problems and are reporting problems. All good things.

Sure, security problems are not good but the absence of CVEs in a project doesn’t say that the project is one bit more secure. It could just mean that nobody ever looked closely enough or that the project doesn’t deal with responsible disclosure of the problems.

When I look through the projects they have right now, I get the feeling the resolution (0-15) is too low and they’ve shied away from more aggressively handing out penalty based on factors we all recognize in abandoned/dead projects (some of which are decently specified in Tom Calloway’s blog post mentioned above).

The result being that the projects get a score that is mostly based on what kind of project it is.

But this said, they have several improvements to their algorithm already suggested in their issue tracker. I firmly believe this will improve over time.

The riskiest ?

The top three projects, the only ones that scores 13 right now are expat, procmail and unzip. All of them really small projects (source code wise) that have been around since a very long time.

curl, being the project I of course look out for, scores a 9: many CVEs (3), written in C (2), network exposure (2), 5+ apps depend on it (2). Seriously, based on these factors, how would you say the project is situated?

In the sorted list with a little over 400 projects, curl is rated #73 (at the time of this writing at least). Just after reportbug but before libattr1. [curl summary – which is mentioning a very old curl release]

But the list of projects mysteriously lack many projects. Like I couldn’t find neither c-ares nor libssh2. They may not be super big, but they’re used by a bunch of smaller and bigger projects at least, including curl itself.

The full list of projects, their meta-data and scores are hosted in their repository on github.

Benefits for projects near me

I can see how projects in my own backyard have gotten some good out of this effort.

I’ve received some really great bug reports and gotten handed security problems in curl by an individual who did his digging funded by this project.

I’ve seen how the foundation sponsored a test suite for c-ares since the project lacked one. Now it doesn’t anymore!

Badges!

In addition to that, the Linux Foundation has also just launched the CII Best Practices Badge Program, to allow open source projects to fill in a bunch of questions and if meeting enough requirements, they will get a “badge” to boast to the world as a “well run project” that meets current open source project best practices.

I’ve joined their mailing list and provided some of my thoughts on the current set of questions, as I consider a few of them to be, well, lets call them “less than optimal”. But then again, which project doesn’t have bugs? We can fix them!

curl is just now marked as “100% compliance” with all the best practices listed. I hope to be able to keep it like that even with future and more best practices added.

libbrotli is brotli in lib form

Brotli is this new cool compression algorithm that Firefox now has support for in Content-Encoding, Chrome will too soon and Eric Lawrence wrote up this nice summary about.

So I’d love to see brotli supported as a Content-Encoding in curl too, and then we just basically have to write some conditional code to detect the brotli library, add the adaption code for it and we should be in a good position. But…

There is (was) no brotli library!

It turns out the brotli team just writes their code to be linked with their tools, without making any library nor making it easy to install and use for third party applications.

an unmotivated circle sawWe can’t have it like that! I rolled up my imaginary sleeves (imaginary since my swag tshirt doesn’t really have sleeves) and I now offer libbrotli to the world. It is just a bunch of files and a build system that sucks in the brotli upstream repo as a submodule and then it builds a decoder library (brotlidec) and an encoder library (brotlienc) out of them. So there’s no code of our own here. Just building on top of the great stuff done by others.

It’s not complicated. It’s nothing fancy. But you can configure, make and make install two libraries and I can now go on and write a curl adaption for this library so that we can get brotli support for it done. Ideally, this (making a library) is something the brotli project will do on their own at some point, but until they do I don’t mind handling this.

As always, dive in and try it out, file any issues you find and send us your pull-requests for everything you can help us out with!

A day in the curl project

cURLI maintain curl and lead the development there. This is how I spend my time an ordinary day in the project. Maybe I don’t do all of these things every single day, but sometimes I do and sometimes I just do a subset of them. I just want to give you a look into what I do and why I don’t add new stuff more often or faster… I spend about one to three hours on the project every day. Let me also stress that curl is a tiny little project in comparison with many other open source projects. I’m certainly not saying otherwise.

the new bug

Someone submits a new bug in the bug tracker or on one of the mailing lists. Most initial bug reports lack sufficient details so the first thing I do is ask for more info and possibly ask the submitter to try a recent version as very often we get bug reported on very old versions. Many bug reports take several demands for more info before the necessary details have been provided. I don’t really start to investigate a problem until I feel I have a sufficient amount of details. We’re a very small core team that acts on other people’s bugs.

the question by a newbie in the project

A new person shows up with a question. The question is usually similar to a FAQ entry or an example but not exactly. It deserves a proper response. This kind of question can often be answered by anyone, but also most people involved in the project don’t feel the need or “familiarity” to respond to such questions and therefore remain quiet.

the old mail I haven’t responded to yet

I want every serious email that reaches the mailing lists to get a response, so all mails that neither I nor anyone else responds to I keep around in my inbox and when I have idle time over I go back and catch up on old mails. Some of them can then of course result in a new bug or patch or whatever. Occasionally I have to resort to simply saving away the old mail without responding in order to catch up, just to cut the list of outstanding things to do a little.

the TODO list for my own sake, things I’d like to get working on

There are always things I really want to see done in the project, and I work on them far too little really. But every once in a while I ignore everything else in my life for a couple of hours and spend them on adding a new feature or fixing something I’ve been missing. Actual development of new features is a very small fraction of all time I spend on this project.

the list of open bug reports

I regularly revisit this list to see what I can do to push the open ones forward. Follow-up questions, deep dives into source code and specifications or just the sad realization that a particular issue won’t be fixed within the nearest time (year?) so that I close it as “future” and add the problem to our KNOWN_BUGS document. I strive to keep the bug list clean and only keep relevant bugs open. Those issues that are not reproducible, are left without the proper attention from the reporter or otherwise stall will get closed. In general I feel quite lonely as responder in the bug tracker…

the mailing list threads that are sort of dying but I do want some progress or feedback on

In my primary email inbox I usually keep ongoing threads around. Lots of discussions just silently stop getting more posts and thus slowly wither away further up the list to become forgotten and ignored. With some interval I go back to see if the posters are still around, if there’s any more feedback or whatever in order to figure out how to proceed with the subject. Very often this makes me get nothing at all back and instead I just save away the entire conversation thread, forget about it and move on.

the blog post I want to do about a recent change or fix I did I’d like to highlight

I try to explain some changes to the world in blog posts. Not all changes but the ones that are somehow noteworthy as they perhaps change the way things have been or introduce new fun features perhaps not that easily spotted. Of course all features are always documented etc, but sometimes I feel I need to put some extra attention on focus on things in a more free-form style. Or I just write about meta stuff, like this very posting.

the reviewing and merging of patches

One of the most important tasks I have is to review patches. I’m basically the only person in the project who volunteers to review patches against any angle or corner of the project. When people have spent time and effort and gallantly send the results of their labor our way in the best possible format (a patch!), the submitter deserves a good review and proper feedback. Also, paving the road for more patches is one of the best way to scale the project. Helping newcomers become productive is important.

Patches are preferably posted on the mailing lists but there’s also some coming in via pull requests on github and while I strongly discourage that (due to them not getting the same attention and possible scrutiny on the list like the others) I sometimes let them through anyway just to be smooth.

When the patch looks good (or sometimes good enough and I just edit some minor detail), I merge it.

the non-disclosed discussions about a potential security problem

We’re a small project with a wide reach and security problems can potentially have grave impact on users. We take security seriously, and we very often have at least one non-public discussion going on about a problem in curl that may have security implications. We then often work on phrasing security advisories, working down exactly which versions that are vulnerable, producing patches for at least the most recent ones of those affected versions and so on.

tame stackoverflow

stackoverflow.com has become almost like a wikipedia for source code and programming related issues (although it isn’t wiki), and that site is one of the primary referrers to curl’s web site these days. I tend to glance over the curl and libcurl related questions and offer my answers at times. If nothing else, it is good to help keeping the amount of disinformation at low levels.

I strongly disapprove of people filing bug reports on such places or even very detailed (lib)curl core questions that should’ve been asked on the curl-library list.

there are idle times too

Yeah. Not very often, but sometimes I actually just need a day off all this. Sometimes I just don’t find motivation or energy enough to dig into that terrible seldom-happening bug on a platform I’ve never seen personally. A project like this never ends. The same day we release a new release, we just reset our clocks and we’re back on improving curl, fixing bugs and cleaning up things for the next release. Forever and ever until the end of time.

keep-calm-and-improve-curl

The “right” keyboard layout

I’ve never considered myself very picky about the particular keyboard I use for my machines. Sure, I work full-time and spare time in front of the same computer and thus I easily spend 2500-3000 hours a year in front of it but I haven’t thought much about it. I wish I had some actual stats on how many key-presses I do on my keyboard on an average day or year or so.

Then, one of these hot summer days this summer I left the roof window above my work place a little bit too much open when a very intense rain storm hit our neighborhood when I was away for a brief moment and to put it shortly, the huge amounts of water that poured in luckily only destroyed one piece of electronics for me: my trusty old keyboard. The keyboard I just randomly picked from some old computer without any consideration a bunch of years ago.

So the old was dead, I just picked another keyboard I had lying around.

But man, very soft rubber-style keys are very annoying to work with. Then I picked another with a weird layout and a control-key that required a little too much pressure to work for it to be comfortable. So, my race for a good enough keyboard had begun. Obviously I couldn’t just pick a random cheap new one and be happy with it.

Nordic key layout

That’s what they call it. It is even a Swedish layout, which among a few other details means it features å, ä and ö keys at a rather prominent place. See illustration. Those letters are used fairly frequently in our language. We have a few peculiarities in the Swedish layout that is downright impractical for programming, like how the {[]} – symbols all require AltGr pressed and slash, asterisk and underscore require Shift to be pressed etc. Still, I’v’e learned to program on such a layout so I’m quite used to those odd choices by now…

kb-nordic

Cursor keys

I want the cursor keys to be of “standard size”, have the correct location and relative positions. Like below. Also, the page up and page down keys should not be located close to the cursor keys (like many laptop keyboards do).

keyboard with marked cursorkeys

Page up and down

The page up and page down keys should instead be located in the group of six keys above the cursor keys. The group should have a little gap between it and the three keys (print screen, scroll lock and pause/break) above them so that finding the upper row is easy and quick without looking.

page up and down keysBackspace

I’m not really a good keyboard typist. I do a lot of mistakes and I need to use the backspace key quite a lot when doing so. Thus I’m a huge fan of the slightly enlarged backspace key layout so that I can find and hit that key easily. Also, the return key is a fairly important one so I like the enlarged and strangely shaped version of that as well. Pretty standard.

kb-backspaceFurther details

The Escape key should have a little gap below it so that I can find it easily without looking.

The Caps lock key is completely useless for locking caps is not something a normal person does, but it can be reprogrammed for other purposes. I’ve still refrained from doing so, mostly to not get accustomed to “weird” setups that makes it (even) harder for me to move between different keyboards at different places. Just recently I’ve configured it to work as ctrl – let’s see how that works out.

The F-keys are pretty useless. I use F5 sometimes to refresh web pages but as ctrl-r works just as well I don’t see a strong need for them in my life.

Numpad – a completely useless piece of the keyboard that I would love to get rid of – I never use any of those key. Never. Unfortunately I haven’t found any otherwise decent keyboards without the numpad.

Func KB-460

The Func KB-460 is the keyboard I ended up with this time in my search. It has some fun extra cruft such as two USB ports and a red backlight (that can be made to pulse). The backlight gave me extra points from my kids.

Func KB-460 keyboard

It is “mechanical” which obviously is some sort of thing among keyboards that has followers and is supposed to be very good. I remain optimistic about this particular model, even if there are a few minor things with it I haven’t yet gotten used to. I hope I’ll just get used to them.

This keyboard has Cherry MX Red linear switches.

How it could look

Based on my preferences and what keys I think I use, I figure an ideal keyboard layout for me could very well look like this:

my keyboard layout

Keyfreq

I have decided to go further and “scientifically” measure how I use my keyboard, which keys I use the most and similar data and metrics. Turns out the most common keylog program on Linux doesn’t log enough details, so I forked it and created keyfreq for this purpose. I’ll report details about this separately – soon.

See also: fixing the Func KLB-460 key

Parallel Spaghetti – decoded

Here’s the decoding procedure for the Parallel Spaghetti Decode challenge.

Step 1, the answers to all the questions. You will notice that I did have some fun in D6 and E2, but since they were boxes that weren’t on the right track anyway I thought you’d still enjoy them.

Step 2, let me illustrate how the above answers will take you through the maze. The correct path is made up out of yellow boxes and the correct answers are shown with red arrows leading forward. Click it for full resolution version.

The parallel spaghetti challenge correct track shown

Step 3, those different colors in the “Word” column give you the words used for the two questions. If you rearrange them, the two questions become:

which tr command line option specifies delete characters

and

what curl command line option specifies POST requests

So, it took about 14 minutes at our event for Oscar Andersson to bring the correct answer to me:

-d

source code survival rate

The curl project has its roots in the late 1996, but we haven’t kept track of all of the early code history. We imported our code to Sourceforge late 1999 and that’s how far back we can see in our current git repository. The exact date is “Wed Dec 29 14:20:26 UTC 1999”. So, almost 14 years of development.

Warning: this blog post contains more useless info and graphs than many mortals can handle. Be aware!

How much old code remain in the current source tree? Or perhaps put differently: how is the refresh rate of the code? We fix bugs, we change things, we add features. Surely we’ll slowly over time rewrite the old code and replace it with new more shiny and better working code? I decided to check this. Here’s what I found!

The tools

We have all code in git. ‘git blame’ is the primary tool I used as it lists all lines of all source code and tells us when it was added. I did some additional perl scripting around it.

The code

I decided to check all code in the src/ and lib/ directories in the curl and libcurl source tree. The source code is used to create both the curl tool and the libcurl library and back in 1999 there was no libcurl like today so we do get a slightly better coverage of history this way.

In total this sums up to some 112000 lines in the current .c and .h files.

To count the total amount of commits done to those specific files through history I ran:

git log --oneline src/*.[ch] lib/*.[ch] | wc -l

6047 commits in total. (if I don’t specify the files and count all commits in the repo it ends up at 16954)

git stats

We run gitstats on the curl repo every day so you can go there for some more and current stats. Right now it tells us that average number of commits is 4.7 per active day (that means days when actually something was committed), or 3.4 per all days over the entire time. There was git activity 3576 days in total. By 224 authors.

Surviving commits

How much of the code would you think still remains that were present already that December day 1999?

How much of the code in the current code base would you think was written the last few years?

Commit vs Author vs Date

I wanted to see how much old code that exists, or perhaps how the age of the code is represented in the current code base. I decided to therefore base my logic on the author time that git tracks. It is basically the time when the author of a change commits it to his/her local tree as then the change can be applied later on by a committer that can be someone else, but the author time remains the same. Sometimes a committer commits multiple patches at once, possibly at a much later time etc so I figured the author time would be a better time stamp. I also decided to track the date instead of just the commit hash so that I can sort the changes properly and also make interesting graphs that are based on that time. I use the time with a second precision so changes done a second apart will be recorded as two separate changes while two commits done with the same author time stamp will be counted as the same time.

I had my script run ‘git blame –line-porcelain’ for all files and had my script sum up all changes done on the same time.

Some totals

The code base contains changes written at 4147 different times. Converted to UTC times, they happened on 2076 unique days. On 167 unique months. That’s every month since the beginning.

We’re talking about 312 files.

Number of lines changed over time

A graph with changes over time. The Y axis is number of lines that were changed on that particular time. (click for higher res)

Lines changed over time

Ok you object, that doesn’t look very appealing. So here’s the same data but with all the changes accumulated over time.

accumulated

Do you think the same as I do? Isn’t it strangely linear? It seems that the number of added lines that remain in the code today is virtually the same over time! But fair enough, the changes in the X axis are not distributed according to the time/date they represent so we shouldn’t be fooled by the time, but certainly we can see that changes in general only bring in a certain amount of surviving modified lines.

Another way to count the changes is then to check all the ~4000 change times of the present code, and see how many days between them there are:

delta

Ah, now finally we’re seeing something. Older code that is still present clearly was made with longer periods in between the changes that have lasted. It makes perfect sense to me, since the many years of development probably have later overwritten a lot of code that was written in between.

Also, it is clearly that among the more recent changes that have survived they were often done on the same day or just a few days away from another lasting change.

Grouped on date ranges

The number of modified lines split up on the individual year the change came in.

year

Interesting! The general trend is clear and not surprising. Two years stand out from the trend, 2004 and 2011. I have not yet investigated what particular larger changes that were made those years that have survived. The bump for 1999 is simply the original import and most of those lines are preprocessor lines like #ifdef and #include or just opening and closing braces { and }.

Splitting up the number of surviving lines on the specific year+month they were added:

month

This helps us analyze the previous chart. As we can see, the rather tall bars from 2004 and 2011 are actually several months wide and explains the bumps in the year-chart. Clearly we made some larger effort on those periods that were good enough to still remain in the code.

Correlate to added or removed lines?

So, can we perhaps see if some years’ more activity in number of added or removed source lines can be tracked back to explain the number of surviving source code lines? I ran “git diff [hash1]..[hash2] –stat — lib/*.[ch] src/*.[ch]” for all years to get a summary of number of added and removed source code lines that year. I added those number to the table with surviving lines and then I made another graph:

year-again

Funnily enough, we see almost an exact correlation there for the first eight years and then the pattern breaks. From the year 2009 the number of removed lines went down but still the amount of surving lines went up quite a bit and then the graphs jump around a bit.

My interpretation of this graph is this boring: the amount of surviving code in absolute numbers is clearly correlating to the amount of added code. And that we removed more code yearly in the 2000-2003 period than what has survived.

But notice how the blue line is closing the gap to the orange/red one over time, which means that percentage wise there’s more surviving code in more recent code! How much?

Here’s the amount of surviving lines/added lines and a second graph looking at surviving lines/(added + removed) to see if the mere source code activity would be a more suitable factor to compare against…

relation survival vs added and removed lines

Code committed within the last 5 years are basically 75% left but then it goes downhill down to the 18% survival rate of the 1999 code import.

If you can think of other good info to dig out, let me know!

1999,1699
2000,1115
2001,3061
2002,2432
2003,2578
2004,7644
2005,4016
2006,5101
2007,7665
2008,7292
2009,9460
2010,11762
2011,19642
2012,11842
2013,16844