Tag Archives: git

source code survival rate

The curl project has its roots in the late 1996, but we haven’t kept track of all of the early code history. We imported our code to Sourceforge late 1999 and that’s how far back we can see in our current git repository. The exact date is “Wed Dec 29 14:20:26 UTC 1999”. So, almost 14 years of development.

Warning: this blog post contains more useless info and graphs than many mortals can handle. Be aware!

How much old code remain in the current source tree? Or perhaps put differently: how is the refresh rate of the code? We fix bugs, we change things, we add features. Surely we’ll slowly over time rewrite the old code and replace it with new more shiny and better working code? I decided to check this. Here’s what I found!

The tools

We have all code in git. ‘git blame’ is the primary tool I used as it lists all lines of all source code and tells us when it was added. I did some additional perl scripting around it.

The code

I decided to check all code in the src/ and lib/ directories in the curl and libcurl source tree. The source code is used to create both the curl tool and the libcurl library and back in 1999 there was no libcurl like today so we do get a slightly better coverage of history this way.

In total this sums up to some 112000 lines in the current .c and .h files.

To count the total amount of commits done to those specific files through history I ran:

git log --oneline src/*.[ch] lib/*.[ch] | wc -l

6047 commits in total. (if I don’t specify the files and count all commits in the repo it ends up at 16954)

git stats

We run gitstats on the curl repo every day so you can go there for some more and current stats. Right now it tells us that average number of commits is 4.7 per active day (that means days when actually something was committed), or 3.4 per all days over the entire time. There was git activity 3576 days in total. By 224 authors.

Surviving commits

How much of the code would you think still remains that were present already that December day 1999?

How much of the code in the current code base would you think was written the last few years?

Commit vs Author vs Date

I wanted to see how much old code that exists, or perhaps how the age of the code is represented in the current code base. I decided to therefore base my logic on the author time that git tracks. It is basically the time when the author of a change commits it to his/her local tree as then the change can be applied later on by a committer that can be someone else, but the author time remains the same. Sometimes a committer commits multiple patches at once, possibly at a much later time etc so I figured the author time would be a better time stamp. I also decided to track the date instead of just the commit hash so that I can sort the changes properly and also make interesting graphs that are based on that time. I use the time with a second precision so changes done a second apart will be recorded as two separate changes while two commits done with the same author time stamp will be counted as the same time.

I had my script run ‘git blame –line-porcelain’ for all files and had my script sum up all changes done on the same time.

Some totals

The code base contains changes written at 4147 different times. Converted to UTC times, they happened on 2076 unique days. On 167 unique months. That’s every month since the beginning.

We’re talking about 312 files.

Number of lines changed over time

A graph with changes over time. The Y axis is number of lines that were changed on that particular time. (click for higher res)

Lines changed over time

Ok you object, that doesn’t look very appealing. So here’s the same data but with all the changes accumulated over time.

accumulated

Do you think the same as I do? Isn’t it strangely linear? It seems that the number of added lines that remain in the code today is virtually the same over time! But fair enough, the changes in the X axis are not distributed according to the time/date they represent so we shouldn’t be fooled by the time, but certainly we can see that changes in general only bring in a certain amount of surviving modified lines.

Another way to count the changes is then to check all the ~4000 change times of the present code, and see how many days between them there are:

delta

Ah, now finally we’re seeing something. Older code that is still present clearly was made with longer periods in between the changes that have lasted. It makes perfect sense to me, since the many years of development probably have later overwritten a lot of code that was written in between.

Also, it is clearly that among the more recent changes that have survived they were often done on the same day or just a few days away from another lasting change.

Grouped on date ranges

The number of modified lines split up on the individual year the change came in.

year

Interesting! The general trend is clear and not surprising. Two years stand out from the trend, 2004 and 2011. I have not yet investigated what particular larger changes that were made those years that have survived. The bump for 1999 is simply the original import and most of those lines are preprocessor lines like #ifdef and #include or just opening and closing braces { and }.

Splitting up the number of surviving lines on the specific year+month they were added:

month

This helps us analyze the previous chart. As we can see, the rather tall bars from 2004 and 2011 are actually several months wide and explains the bumps in the year-chart. Clearly we made some larger effort on those periods that were good enough to still remain in the code.

Correlate to added or removed lines?

So, can we perhaps see if some years’ more activity in number of added or removed source lines can be tracked back to explain the number of surviving source code lines? I ran “git diff [hash1]..[hash2] –stat — lib/*.[ch] src/*.[ch]” for all years to get a summary of number of added and removed source code lines that year. I added those number to the table with surviving lines and then I made another graph:

year-again

Funnily enough, we see almost an exact correlation there for the first eight years and then the pattern breaks. From the year 2009 the number of removed lines went down but still the amount of surving lines went up quite a bit and then the graphs jump around a bit.

My interpretation of this graph is this boring: the amount of surviving code in absolute numbers is clearly correlating to the amount of added code. And that we removed more code yearly in the 2000-2003 period than what has survived.

But notice how the blue line is closing the gap to the orange/red one over time, which means that percentage wise there’s more surviving code in more recent code! How much?

Here’s the amount of surviving lines/added lines and a second graph looking at surviving lines/(added + removed) to see if the mere source code activity would be a more suitable factor to compare against…

relation survival vs added and removed lines

Code committed within the last 5 years are basically 75% left but then it goes downhill down to the 18% survival rate of the 1999 code import.

If you can think of other good info to dig out, let me know!

1999,1699
2000,1115
2001,3061
2002,2432
2003,2578
2004,7644
2005,4016
2006,5101
2007,7665
2008,7292
2009,9460
2010,11762
2011,19642
2012,11842
2013,16844

some missing github features

github-social-coding

I think github is a lovely resource for collaborating on source code with my friends all over the globe. Among other things, we host the primary curl repository there and we’ve been doing so for almost three years now. This experience has led me to discover a bunch of things I miss in the service…

github is clearly aimed at repositories run by one person or a small set of persons, while in the projects I run I try to involve as many as possible in wide collaboration and I put efforts into informing everyone to get the widest possible attention and feedback. I may have created the account and “own” the repository, but I want the work to be done by a large team and I want everything that happens to it to be seen by a large audience. This is not always possible to do easily with the existing github services.

To further this spirit and to widen cooperation more, I would like to see the following improvements:

  • pull requests can’t be disabled nor can i control to which email address to send the notification. In our project I want all patches posted to the mailing list for review, archiving and discussions before I get a pull request, and I don’t use GitHub’s merge feature since it is hardly ever good enough (I want fast-forward and I usually feel a need to edit the commit message ever so slightly etc). I want the pull request to get translated into a patch review submission to the mailing list.
  • similarly, I cannot redirect where notifications are sent when someone comments a commit or a source line and this is highly annoying since we merge a lot of outsiders’ patches etc and as they may still read the mailing list we want the discussion there! Many times the contributors don’t have GitHub accounts and of course we don’t want to require that.
  • after the death of the CIA.vc service, the current IRC notification service offered by GitHub is nothing but inferior. The stupid bot has to join, tell its message and leave again. It is not an IRC friendly behavior and I can’t make it announce exactly what I’d like it to say…
  • I wish it had much better email notification on commits that would allow me to customize what it sends out without forcing me to write a full blown replacement. I want a unified diff included!

I realize GitHub has features that offer me to create an “organization” to host a repository instead of it being owned by me as a person, but I don’t think that should be a requirement to get this functionality. And I don’t know if GitHub truly offers better group functionality then either.

my first embedded Linux course

I’m happy to announce that I did my first ever full-day training course for eleven embedded developers Monday November 15th 2010. I had the pleasure to write all the materials myself, come up with three exercises for them and then actually stand in front of the team and deliver a complete session 9 – 17.

Let's say this illustrates Embedded SystemsI did my day as part of a three-day course, and I got to do the easy part: user-space development. My day covered the topics of: Embedded Linux development introduction, how to build, autobuilding, how to run, git basics, debugging, profiling and finally some brief words on testing.

Doing stuff outside of your ordinary schedule and “comfort zone” is certainly a bit scary and encouraging and that’s the sort of thing that makes you grow as a person and as a professional. I mean, I know the topics by heart and certainly pretty much without even thinking (I’ve been working with embedded systems for over 17 years!), but from that into making a decent training course is not just a coffee break worth of work.

I was quite happy and satisfied that I pretty much kept to the program, I managed to go through all the topics I had set myself out to, I think we had a really nice conversation going during the day and the audience gave me really good feedback and high “grades” in the evaluation forms they filled in before they left. Of course there were flaws in the presentation and I got some valuable ideas from my audience on how to improve it.

Now I feel like doing it again!

git, patents, meego and android

dotse-logoAt this Tuesday afternoon, almost 100 people apparently managed to escape work and attend foss-sthlm’s fourth meeting. This time graciously sponsored by .SE who stood for the facilities, the coffee etc. Thank you .SE! Yours truly did his best to make it happen and to make sure we had a variety of talks by skilled people and I think we did good this time as well! This meeting took place at the same time the big Internetdagarna conference had their 6(!) parallel tracks in the building just next to ours, so it was also rather fierce competition for attention.

Robin with git

Robin Rosenberg started off the sessions by telling us about git and related dives into JGit, EGit, gerrit, code reviews and Eclipse. Robin is a core developer in the EGit/JGit projects. While I think I know at least a little git already, Robin provided an overlook over several different things in a good way. (It should be noted that Robin was called in very late in the game due to another talker having to drop out.)

As a side-note, I will never cease to be amazed by the habit in Java land to re-implement everything in “pure Java” instead of simply wrapping around the existing code/tools and leveraging what already exists and is stable…

Jonas with patents

Jonas Bosson spoke about the dangers with software patents and how they are not good, they’re hindering innovation and cost a lot of money for everyone involved. He also pledged the audience to join FFII to help the cause. You can tell Jonas is quite committed to this subject and really believes in this! And quite frankly, I don’t think a lot of people in this surrounding would argue against him…

Andreas with MeegoMeeGo

Andreas Jakl, just arrived from a rainy Helsinki, then told us (in English while all the other talks were in Swedish) about how to develop stuff with Qt for Symbian, Meego or desktop using the same tools. He showed us the latest fancy GUI builder they have called Qt Quick and how they use QML to do fancy things in a fast manner. He also managed to show us the code running in simulator and on device. Quite impressive. Andreas is a very good speaker and did a very complete session. As a bonus point, he used ‘haxx.se’ as test site for demonstrating his little demo build and of course you can’t help loving him more then? 😉

Johan with Android

Android

Johan Nilsson started off just after the coffee break with educating us how you can do push stuff from your server applications to your mobile device. How it works and how to control that in various way. Johan’s presentation was into details, in a way at least I really appreciated it, and his (hand drawn on paper then scanned) graphics used in the presentation were stunning! The fact that Johan sneaked in a couple of curl command lines of course gave him bonus points in my mind! 😉

Henrik with FribidFribid

Henrik Nordström took the stage and briefed us on the status of the Fribid project, which is a very Swedish-centric project that works on implementing full Linux support for “bankid” which is a electronic identification system established by a consortium of Swedish banks and is used by a wide range of authorities and organizations these days. The existing Linux client is poor (and hard to get working right), closed source, saves data encrypted with private hidden keys etc.

Food, Talk, Tablets

We_Tab-140-Motiv_4-3

In the restaurant after the seminaries, we gathered in a basement with beer in our glasses and chili on our plates and there was lots of open source and foss talks and we had a great time and good drinks. Two attendees brought their new tablets, which made us able to play with them and compare. the Android Samsung Galaxy Tab and the German Meego based WeTab.

Samsung Galaxy TabTo me there really wasn’t any competition. The Galaxy Tab is a slick, fast and nice device that feels like a big Android phone and it’s really usable and I could possibly see myself using it for emails and browsing. It was a while since I tried an Ipad but it gave about the same speed impression.

The WeTab however, even if it runs a modified Meego that isn’t “original” and that might suffer from bugs and what not, was a rough UI that looked far too much like my regular X Window system put in a touch device. For example, and I think this is telling, you scroll a web page down by using the right-side scroll bar and not by touching the screen in the middle and dragging it down like you’d do on IOS or Android. In fact, when I dragged down the scroll-bar like that I found it far too easy to accidentally then press one of the buttons that are always present immediately to the right of the scrollbar. Of course, the Galaxy Tab is a smaller device and also much more expensive, so perhaps those factors will bring a few users to WeTab then still.

I don’t think I’ll get a tablet anytime soon though, I just don’t see when I would use it.

Summary

I didn’t do any particular talk this time, but I felt we had a lot of good content and I can always blurb another time anyway. I really really like that we so far have managed to get lots of different speakers and I hope that we can continue to get many new speakers before we have to recycle.

It was a great afternoon and evening. All the good people and encouraging words inspire me to keep up my work and efforts on this, and I’m now aiming towards another meeting in the early 2011.

I will do another post later on when the videos from these talks go online.

roffit lives!

Many moons ago I created a little tool I named roffit. It is just a tiny perl script that converts a man page written in the nroff format to good-looking HTML. I should perhaps also add that I didn’t find any decent alternatives then so I wrote up my own version. I’ve been using it since in projects such as curl, c-ares and libssh2 to produce web versions of the docs.

It has just done its job and I haven’t had any needs to fiddle with it. The project page lists it as last modified in 2004, even though I actually moved it to a sourceforge CVS repo back in 2007.

Just a few days ago, I got emailed and was notified that Debian has it included as a package in the distribution and someone was annoyed on some particular flaws.

This resulted in a bunch of bugs getting submitted to the Debian bug tracker, I started up the brand new roffit-devel mailing list to easier host roffit discussions and I switched over the CVS repo to a git one on github.

If you like seeing man pages turned into web pages, consider joining up and help us improve this thing!

curl goes git

Just a few days ago the curl project turned twelve years old, and I decided that it was time for us to ditch our trusty old CVS setup and switch over to use git instead for source code control.

Why Switch at All

I’ve been very content with CVS over the years and in our small project we don’t really have any particularly weird or high demands on the version control software.

Lately (like in recent years) I’ve dipped my toes into various projects that have been using git, and more and more over time I’ve learned to appreciate the little goodies that git does that CVS simply cannot. I’m then not even speaking about branches or merges etc that git does a whole lot better and easier than CVS, I’m in fact even more in love with git’s way to ease handling with diffs sent by email and its great way of keeping track of authors separately from the committer etc. git am and git commit –author are simply two very handy tools missing in CVS.

Why Git

So if we want to switch from CVS to another tool what would we chose? That wasn’t really the question in my case so I didn’t answer it. In my case, it was rather that I’ve been using git in several projects and it is used in some of the biggest projects I work with so it was some git’s features I wanted. I didn’t consider any of the other distributed version tools as quite frankly: they wouldn’t be much better for me than what CVS already is. I want to reduce the number of different tools I need, and I’m quite sure anyway that git is one of the top contenders even if I would do an actual comparison.

So the choice to go git was quite selfish and done by me, but I felt that quite a few guys in the curl community supported this decision and very few actually believe remaining with CVS was a better idea.

The fact that git itself uses libcurl for its HTTP access of course also proves its good taste! 🙂

How did the conversion go

Very easy and swiftly. First, as I mentioned above we never used branches much so we basically had a linear development with a set of tags. I did an rsync of the full repo to get a local copy to work with, then I ran ‘git cvsimport’ on that to created a new repo. I did run it a couple of times to make sure I had done a correct mapping of all CVS user names to their git equivalents. Converting >10 years of CVS commits took roughly 10 minutes on my desktop machine so it wasn’t that tedious even.

Once I had a local repo created with all authors looking good, I simply followed the instructions on github.com on how to add a remote origin to a local branch and when I pushed to that, git sent off all commits ever made to curl to the remote repo now exposed to the world from github.com.

cURL

When that part was done, I did a quick read on the ‘git help daemon’ docs and 30 seconds later I had a local repo setup that is a mirror of the github one, so that users can still opt to get the code from haxx.se.

Unchanged work flow

Git allows different ways of working with the code, but I’ve decided that at least as a start we won’t change the way we work. I’ll offer all committers push rights to the master branch on the repository and we will simply all push to that, as our head development branch.

We will prefer patches made with git format-patch sent to the mailing list, but as before you can still produce patches by diffing source code using extracted tarballs or whatever approach you prefer.

All details on how to get the code for curl using git is available online.

libcurl in version management

Already before, I’ve mentioned that libcurl is becoming popular within package management.

libcurllibcurl is a generic library for file transfers over a wide variety of protocols. Over the years, some of the recent ditributed version management softwares have learned about libcurl’s powers and they now use it:

darcs – was born in 2003 and is written in Haskell. I’m under the impression these guys wrote their own binding layer to interface libcurl from Haskell.

git – possibly best known for being created by Linus Torvalds and being used by the Linux kernel project, is using libcurl for HTTP(S) accesses.

bazaar – is written in Python and accordingly uses the pycurl binding for libcurl.

Anyone know of other version control systems using libcurl?

Ironies here include that libcurl itself is still kept within a CVS respository, and also quite possibly that the first version management project I myself participated is Subversion and that not only has two different HTTP dependencies, but none of those two are libcurl (they are neon and serf)…

Update: it seems that Mercurial is also using pycurl as an optional dependency.