Tag Archives: statistics

decomplexifying curl

(I wrote about this topic in my weekly email this week. This is the blog version, somewhat extended.)

Easy to read

Two contributing factors that make code hard to read are function length and function complexity. To keep source code easy to read, understand and debug we should strive towards keeping functions short and simple. Nothing ground-breaking in that conclusion.

I know, it sounds really simple and straight forward but in a living project that goes on for decades, code develops, moves and grows over time. What started out small and simple risk gradually turning into something else.

This of course because there are so many more factors involved that need to be given focus as well. Like security, bugfixes, performance, food on the table and getting more people involved.

Graphs graphs graphs

Last week I added two more graphs to the curl dashboard showing function complexity and function length growth in curl code over the decades: one plot for the worst function and one plot for the 99th percentile in each graph. For both graphs, the 99th percentile plots shrink gradually over time but the worst offenders grow. This means that there are a few functions that with attention could improve readability and code maintainability but that in general things are under control.

One of the main points for me with graphing the project from as many angles as possible is to unveil things like this. Areas that might need attention, and then keep a check on these areas going forward. Details like these are otherwise rather subtle and not easily detected when manually browsing around.

It has been said that whatever measurement you use to track engineering progress, that will then become the goal for what engineers work towards. I hope to combat this by measuring (and graphing) as many angles as possible of the curl project. To help push us in the right direction in as many different areas as possible.

Improve

I took it upon myself to improve the situation: to reduce the size of the largest function in the code base and to simplify the most complex one. Incidentally they were different functions: the largest function was the big switch handling curl_easy_setopt options, and the most complex one was the main curl tool function setting up a single transfer.

These two functions had simply just slowly and consistently been growing over time, in size and complexity. No one’s “fault” really and not with any specific plan or intention. The graph helped me decide to act and the pmccabe tool helped me identify them. We can of course argue about the specific method or number that pmccabe presents for complexity, but I think it at least is pretty good at actually identifying the correct functions and the exact particular score it sets is not terribly important.

Both pull-requests became > 2000 modified lines monsters, but they also had immediate and distinct effects on the graphs; which ideally should mean that the code readability is now a little better than before, making the functions easier to improve and work with going forward

Complexity

The single worst function in production code had gotten quite complex. I spent a work day on the case and look at the drop on the right edge of the graph below, made after my fix landed. Most of the job was to properly split the function into several smaller ones that made sense.

The single worst offender at this particular time was the function in the curl tool that sets up a single transfer job.

There are still some pretty complex ones remaining. Room for further improvements no doubt.

Function length

The worst offenders in terms of function size in curl have been of two kinds: state machines with many states and functions handling big switches for options.

In this particular case, this was the big function handling curl_easy_setopt(), and as we have over three hundred options having them all handled in a single function made it very big. The new setup splits that handling up into multiple smaller functions, one for each kind of input.

The largest one is now at over 1,500 lines. Still on the too large side of things but way better than before.

Going forward

Yes, I am a graphaholic and I seem to keep finding new ways to illustrate project status and development using plots on timelines. I am also most likely the biggest consumer of these graphs as I monitor them daily to make sure I have full control of how we are in the project, in every imaginable aspect.

I intend to try to continue simplifying a few more of the functions in the pmccabe toplist.

Let’s see what the graph shows in another three years.

Keyboard key frequency

A while ago I wrote about my hunt for a new keyboard, and in my follow-up conversations with friends around that subject I quickly came to the conclusion I should get myself better analysis and data on how I actually use a keyboard and the individual keys on it. And if you know me, you know I like (useless) statistics.

Func KB-460 keyboardSo, I tried out the popular and widely used Linux key-logger software ‘logkeys‘ and immediately figured out that it doesn’t really support the precision and detail level I wanted so I forked the project and modified the code to work the way I want it: keyfreq was born. Code on github. (I forked it because I couldn’t find any way to send back my modifications to the upstream project, I don’t really feel a need for another project.)

Then I fired up the logging process and it has been running in the background for a while now, logging every key stroke with a time stamp.

Counting key frequency and how it gets distributed very quickly turns into basically seeing when I’m active in front of the computer and it also gave me thoughts around what a high key frequency actually means in terms of activity and productivity. Does a really high key frequency really mean that I was working intensely or isn’t that purpose more a sign of mail sending time? When I debug problems or research details, won’t those periods result in slower key activity?

In the end I guess that over time, the key frequency chart basically says that if I have pressed a lot of keys during a period, I was working on something then. Hours or days with a very low average key frequency are probably times when I don’t work as much.

The weekend key frequency is bound to be slightly wrong due to me sometimes doing weekend hacking on other computers where I don’t log the keys since my results are recorded from a single specific keyboard only.

Conclusions

So what did I learn? Here are some conclusions and results from 1276614 keystrokes done over a period of the most recent 52 calendar days.

I have a 105-key keyboard, but during this period I only pressed 90 unique keys. Out of the 90 keys I pressed, 3 were pressed more than 5% of the time – each. In fact, those 3 keys are more than 20% of all keystrokes. Those keys are: <Space>, <Backspace> and the letter ‘e’.

<Space> stands out from all the rest as it has been used more than 10%.

Only 29 keys were used more than 1% of the presses, giving this a really long tail with lots of keys hardly ever used.

Over this logged time, I have registered key strokes during 46% of all hours. Counting only the hours in which I actually used the keyboard, the average number of key strokes were 2185/hour, 36 keys/minute.

The average week day (excluding weekend days), I registered 32486 key presses. The most active sinngle minute during this logging period, I hit 405 keys. The most active single hour I managed to do 7937 key presses. During weekends my activity is much lower, and then I average at 5778 keys/day (7.2% of all activity were weekends).

When counting most active hours over the day, there are 14 hours that have more than 1% activity and there are 5 with less than 1%, leaving 5 hours with no keyboard activity at all (02:00- 06:59). Interestingly, the hour between 23-24 at night is the single most busy hour for me, with 12.5% of all keypresses during the period.

Random “anecdotes”

Longest contiguous time without keys: 26.4 hours

Longest key sequence without backspace: 946

There are 7 keys I only pressed once during this period; 4 of them are on the numerical keypad and the other three are F10, F3 and <Pause>.

More

I’ll try to keep the logging going and see if things change over time or if there later might end up things that can be seen in the data when looked over a longer period.

curlers rest on Sundays and during July

We now run gitstats on the curl git repository daily and provides fun graphs.

We have almost 11 years of source code history covered and I personally have done some ~68% of all commits. Given this long history it is fun to see some very clear trends. Like this first one: look at the distribution of commits per weekday over the entire period. The amount of commits done during weekends are significantly lower than during the work week, and the Sunday amount is clearly even lower than Saturday:

day_of_week

Similarly, we can see how the activity is spread out over calendar months. This shows an obvious correlation to the slower periods in my life, which means that July is vacation times and the numbers show it:

month_of_year

curl: ten years of more code and contributors

It feels like I’ve been doing curl forever, while in fact it is “only” in its early teens. I decided to dig up some numbers on how the development have been within the project over the last decade. How have things changed during the 10 most recent years.

To spice up the numbers, I generated some graphs based on them and to then make the graphs nice and presentable I moved them all over to a single graph using my super gimp powers.

Bugs, Linus of code and contributors over time in curl

Click the image to get a full resolution version. But even the small one shows the data I wanted to illustrate: we gain contributors in roughly the same speed as we grow in lines of code. And at the same time we get roughly the same amount of bug reports over the years, apparently independently from the amount of code and contributors! Note that I separate the bug fixed bars from the bug report bars because bug fixed is the amount of bugfixes mentioned in release notes while the bug reports is the count in the web based bug tracker. As seen we fixed a lot more bugs than we get submitted in the bug tracker.

I should add that the reason the green contributor line starts out a little slow and gets a speed bump after a while, is that I changed my way of working at that point and got much better at tracking exactly all contributors. The general angle on the curve for the last 4-5 years is however what I think is the interesting part of it. How it is basically the same angle as the source code increase.

The bug report counter is merely taken from our bug tracker at sourceforge, which is a very inexact count as a very large amount of bugs are reported on the mailing lists only.

Data from the curl release table, tells that during these 10 years we’ve done 77 releases in which we fixed 1414 bugs. That’s 18.4 bug fixes per release and one release roughly every 47 days on average. 141 bug fixes per year on average.

To see how this development has changed over time I decided to compare those numbers against those for the most recent 2.5 years. During this most recent 25% of the period we’ve done releases every 60 days on average but counting 155 bug fixes per year. Which made that the average number of bug fixes per release have gone up to 26; one bugfix every 2.3 days.

A more negative interpretation on this could be that we’re only capable of a certain amount of bug fixes per time so no matter how much code we get we fix bugs at roughly the same rate. The fact that we don’t get any increasing amount of bug reports of course speaks against this theory.

A view of a popular post

So I post frequently on this blog, but I’m not a particularly interesting person myself, I’m not really a master at writing and phrasing articles to make them thrilling and irresistible and I basically only deal with really geeky and technical subjects. It means there’s an average of perhaps 200 views per day.

The other day I wrote my multipath tcp post, and someone submitted it to reddit. It turned out to become my most read posting on my blog ever. By far. I think the “views per day” graph looks pretty cool:

visitor graph from daniel.haxx.se/blog

Some stats on curl development

Counting curl 6.0 and up to curl 7.19.3 we’ve done 78 releases during the 9.4 years it took.

In this time, we’ve mentioned 1259 bugfixes and 389 notable changes.

This makes one bugfix done every 2.7 days. One release done every 43rd day with an average of 16 bugfixes done in each. The longest interval ever between two curl releases was 139 days, back in 2000 when we worked to release the first version 7 release (known as 7.1).

To compare with how our work has been more recently, doing the same math limited to the 20 latest releases only (the 3.3 years since and including 7.15.0) shows that we’re still on 2.7 days per bugfix (although we know that the code base has grown steadily for years) but we’re now on 61 days between releases and 21 bugfixes/release…

All this info and more will be visible on a web page on the curl site soonish, I’m still working on polishing it up.

What other useful or useless but interesting numbers could be extracted from this?

4 ohloh improvements I’d like

I am a stats junkie so I like my stats in large amounts. But I like the stats to be right and as accurate as possible, and when I look at what ohloh produces I like the concepts and ideas in general, I just think their implementation is lacking in a few vital areas that need improvement:

1. There are no dependencies or hierarchies between packages, so “I use this” counters get worthless since people mark end-user packages they use. Low-level support packages and libraries that are used indirectly don’t get many “use counts”

2. Doing very few commits in a very well used project with few authors gives you way way more points than doing a bus-load of commits in something less used with many fellow contributors. This makes the top-list of people very skewed as some of the top-64 people only did a few hundred commits ever. I doubt many mortals would consider someone who only ever did 300 commits to be a top community person. At the very moment I write this, the #1 ranked person has done 20 commits during 5 months…!

3. Too few versioning systems are supported, leaving out huge chunks of the open source world. Bazaar, mercurial and a few more are a bit too popular to be ignored without the results getting skewed.

4. I’d like to see the “number of users” of products as a percentage, as the total number of users they show include all contributors to all projects. Out of the 140,000 users (which undoubtedly include a lot of duplicates), it would surprise me if more than 10,000 have actually registered what products they use. I’ve tried to find the exact number but I failed. So 3,000 users don’t mean 3,000 out of 140,000 but 3,000 out of 10,000…

ohloh vs statcvs

I’ve played a bit with statcvs lately and I generated reports for the curl repository. It turned out rather interesting (well, assuming you’re a statistics geek such as me) especially in comparison to the data and stats ohloh.net presents for the same code:

[the images have been lost in time, like tears in rain]

Executive summary:

  • I’ve done 82% of all code changes.
  • We seem to grow at roughly the same pace (both number of code lines and number of files) over the last years.
  • The lines of code per file count seems rather fixed

Oh, that initial big bump at late 1999/early 2000 was due to a lot of “wrong” files such as configure, config.guess etc were committed and subsequently removed. It is a bit annoying to have there as it ruins the data somewhat but I’ve not managed to fool statcvs into ignoring that part…

Rockbox downloads April 2008

I counted the Rockbox downloads from build.rockbox.org during April 2008, and while the results weren’t very different from the past results, I thought I’d still show them. This month, 99874 downloads were counted and we had 30 different packages downloaded. Back in January, we still only had 26 versions. The top-5 are identical to the last list.

The most popular newcomer since my last count is the Olympus Mrobe 100 which has more than twice the number of downloads compared to the second newcomer iAudio m3.

The list shows model and number of downloads. The newcomers since the last count are shown bold.

  1. sansae200 22038
  2. ipodvideo 18289
  3. ipodvideo64mb 12392
  4. ipodnano 12261
  5. sansac200 4176
  6. h300 3071
  7. ipodcolor 2932
  8. ipodmini2g 2875
  9. gigabeatf 2848
  10. ipod4gray 2651
  11. h120 2506
  12. iaudiox5 2498
  13. ipod3g 1717
  14. ipodmini1g 1496
  15. ipod1g2g 1411
  16. h10 1361
  17. h10_5gb 1268
  18. mrobe100 1116
  19. player 564
  20. iaudiom3 528
  21. recorder 500
  22. iaudiom5 284
  23. h100 275
  24. recorder8mb 233
  25. recorderv2 157
  26. cowond2 138
  27. fmrecorder 116
  28. ondiofm 108
  29. ondiosp 58
  30. mrobe500 7

Swedish Broadband Usage

The other day I fell over this interesting report published by ITIF called Explaining International Broadband Leadership (108 pages 3MB PDF) that listed USA and 30 OECD countries and their broadband usage and the report came to numerous conclusions and advice why the US is falling behind in the ranks and so on. Quite interesting read in general.

In their ranking table, Sweden is listed at #6. I immediately noticed the column called “Household penetration” (subscribers per household). Hm, isn’t that the amount of households that have broadband? It says 0.54 for Sweden. 54% broadband users among the households 2007?

We have this organization in Sweden called “Statistiska CentralbyrÃ¥n” in Swedish and “Statistics Sweden” in english. They basically work with gathering and presenting statistics on Sweden and Swedish related matters. They’ve produced a huge report (in Swedish – 1MB, 256 pages PDF) called “Private citizens’ use of computers and internet 2007” (my translation). They mention that during spring 2007, 71% of the Swedes used broadband internet from their homes. (Over 80% had internet access in their homes, which makes 12% of the users not using broadband…)

Isn’t there a shockingly huge difference between 54 and 71? And this is just a quick number I could check myself for my country. How off is then the other countries’ values? The ITIF report doesn’t even try to describe how they got their numbers so it isn’t easy to see how they got this. The Swedish report does in fact also contain a comparison with other European countries, and the numbers shown for them don’t match the ones in the ITIF report either! (But the order of top broadband using countries is roughly the same.)

I’m also a bit curious on how they got the numbers for the “average download speed in Mbps” column, but I don’t have any numbers to cross-check for that.