Tag Archives: statistics

A 1337 curl author

January 29, 2025 Daniel Stenberg

For quite some time now, I celebrate and welcome every new commit author in the curl project in the public. Recently, that means I send out a toot on Mastodon saying Welcome so and so as curl commit author number XYZ and a link to their initial curl work. (example 1, example 2).

This messaging is not done automatically. GitHub helps out by specifically mentioning in a PR when it is done by a first-timer to the repository, and I have a convenient local script that tells me how many authors we have so far, and then I type up the message myself and send it. (Sometimes I miss one, which I regret.)

This process takes me about seven point five seconds per case of manual labor. Writing an automated script to do this correctly, triggered for the right persons, would take the equivalent of many years of new authors.

For the last few months, people have more and more noticed and replied mentions about the fact that we were approaching commit author number 1337. Lots of people have said things in the style of “I should learn to program soon so that I can become number 1337”.

The number 1337, is of course just a number. I find it amusing and charming that it seems to have this almost magic aura and attraction to so many people in our community.

Today, commit author 1337 was finally announced, only three years since we announced author 1000. There are no permanent records or anything of this fact other than this blog post. Further, there is a risk that we have a duplicate or two somewhere in there so that a recount at a later time will end up differently.

Commit author 1337 became Michael Schuster who wrote this pull request, which fixed a minor build issue in the mbedTLS backend code. Thanks!

337 new authors over the last three years equals roughly two new commit authors per week on average. Pretty good. We have room for many more!

A relevant statistic in this context is also that 65% of all commit authors only ever authored a single commit.

Now, let’s go for author two thousand next…

cURL and libcurl

decomplexifying curl

October 27, 2024 Daniel Stenberg

(I wrote about this topic in my weekly email this week. This is the blog version, somewhat extended.)

Easy to read

Two contributing factors that make code hard to read are function length and function complexity. To keep source code easy to read, understand and debug we should strive towards keeping functions short and simple. Nothing ground-breaking in that conclusion.

I know, it sounds really simple and straight forward but in a living project that goes on for decades, code develops, moves and grows over time. What started out small and simple risk gradually turning into something else.

This of course because there are so many more factors involved that need to be given focus as well. Like security, bugfixes, performance, food on the table and getting more people involved.

Graphs graphs graphs

Last week I added two more graphs to the curl dashboard showing function complexity and function length growth in curl code over the decades: one plot for the worst function and one plot for the 99th percentile in each graph. For both graphs, the 99th percentile plots shrink gradually over time but the worst offenders grow. This means that there are a few functions that with attention could improve readability and code maintainability but that in general things are under control.

One of the main points for me with graphing the project from as many angles as possible is to unveil things like this. Areas that might need attention, and then keep a check on these areas going forward. Details like these are otherwise rather subtle and not easily detected when manually browsing around.

It has been said that whatever measurement you use to track engineering progress, that will then become the goal for what engineers work towards. I hope to combat this by measuring (and graphing) as many angles as possible of the curl project. To help push us in the right direction in as many different areas as possible.

Improve

I took it upon myself to improve the situation: to reduce the size of the largest function in the code base and to simplify the most complex one. Incidentally they were different functions: the largest function was the big switch handling curl_easy_setopt options, and the most complex one was the main curl tool function setting up a single transfer.

These two functions had simply just slowly and consistently been growing over time, in size and complexity. No one’s “fault” really and not with any specific plan or intention. The graph helped me decide to act and the pmccabe tool helped me identify them. We can of course argue about the specific method or number that pmccabe presents for complexity, but I think it at least is pretty good at actually identifying the correct functions and the exact particular score it sets is not terribly important.

Both pull-requests became > 2000 modified lines monsters, but they also had immediate and distinct effects on the graphs; which ideally should mean that the code readability is now a little better than before, making the functions easier to improve and work with going forward

Complexity

The single worst function in production code had gotten quite complex. I spent a work day on the case and look at the drop on the right edge of the graph below, made after my fix landed. Most of the job was to properly split the function into several smaller ones that made sense.

The single worst offender at this particular time was the function in the curl tool that sets up a single transfer job.

There are still some pretty complex ones remaining. Room for further improvements no doubt.

Function length

The worst offenders in terms of function size in curl have been of two kinds: state machines with many states and functions handling big switches for options.

In this particular case, this was the big function handling curl_easy_setopt(), and as we have over three hundred options having them all handled in a single function made it very big. The new setup splits that handling up into multiple smaller functions, one for each kind of input.

The largest one is now at over 1,500 lines. Still on the too large side of things but way better than before.

Going forward

Yes, I am a graphaholic and I seem to keep finding new ways to illustrate project status and development using plots on timelines. I am also most likely the biggest consumer of these graphs as I monitor them daily to make sure I have full control of how we are in the project, in every imaginable aspect.

I intend to try to continue simplifying a few more of the functions in the pmccabe toplist.

Let’s see what the graph shows in another three years.

Electronics, Open Source, Work

Keyboard key frequency

November 12, 2014 Daniel Stenberg

A while ago I wrote about my hunt for a new keyboard, and in my follow-up conversations with friends around that subject I quickly came to the conclusion I should get myself better analysis and data on how I actually use a keyboard and the individual keys on it. And if you know me, you know I like (useless) statistics.

So, I tried out the popular and widely used Linux key-logger software ‘logkeys‘ and immediately figured out that it doesn’t really support the precision and detail level I wanted so I forked the project and modified the code to work the way I want it: keyfreq was born. Code on github. (I forked it because I couldn’t find any way to send back my modifications to the upstream project, I don’t really feel a need for another project.)

Then I fired up the logging process and it has been running in the background for a while now, logging every key stroke with a time stamp.

Counting key frequency and how it gets distributed very quickly turns into basically seeing when I’m active in front of the computer and it also gave me thoughts around what a high key frequency actually means in terms of activity and productivity. Does a really high key frequency really mean that I was working intensely or isn’t that purpose more a sign of mail sending time? When I debug problems or research details, won’t those periods result in slower key activity?

In the end I guess that over time, the key frequency chart basically says that if I have pressed a lot of keys during a period, I was working on something then. Hours or days with a very low average key frequency are probably times when I don’t work as much.

The weekend key frequency is bound to be slightly wrong due to me sometimes doing weekend hacking on other computers where I don’t log the keys since my results are recorded from a single specific keyboard only.

Conclusions

So what did I learn? Here are some conclusions and results from 1276614 keystrokes done over a period of the most recent 52 calendar days.

I have a 105-key keyboard, but during this period I only pressed 90 unique keys. Out of the 90 keys I pressed, 3 were pressed more than 5% of the time – each. In fact, those 3 keys are more than 20% of all keystrokes. Those keys are: <Space>, <Backspace> and the letter ‘e’.

<Space> stands out from all the rest as it has been used more than 10%.

Only 29 keys were used more than 1% of the presses, giving this a really long tail with lots of keys hardly ever used.

Over this logged time, I have registered key strokes during 46% of all hours. Counting only the hours in which I actually used the keyboard, the average number of key strokes were 2185/hour, 36 keys/minute.

The average week day (excluding weekend days), I registered 32486 key presses. The most active sinngle minute during this logging period, I hit 405 keys. The most active single hour I managed to do 7937 key presses. During weekends my activity is much lower, and then I average at 5778 keys/day (7.2% of all activity were weekends).

When counting most active hours over the day, there are 14 hours that have more than 1% activity and there are 5 with less than 1%, leaving 5 hours with no keyboard activity at all (02:00- 06:59). Interestingly, the hour between 23-24 at night is the single most busy hour for me, with 12.5% of all keypresses during the period.

Random “anecdotes”

Longest contiguous time without keys: 26.4 hours

Longest key sequence without backspace: 946

There are 7 keys I only pressed once during this period; 4 of them are on the numerical keypad and the other three are F10, F3 and <Pause>.

I’ll try to keep the logging going and see if things change over time or if there later might end up things that can be seen in the data when looked over a longer period.

cURL and libcurl

curlers rest on Sundays and during July

December 20, 2011 Daniel Stenberg

We now run gitstats on the curl git repository daily and provides fun graphs.

We have almost 11 years of source code history covered and I personally have done some ~68% of all commits. Given this long history it is fun to see some very clear trends. Like this first one: look at the distribution of commits per weekday over the entire period. The amount of commits done during weekends are significantly lower than during the work week, and the Sunday amount is clearly even lower than Saturday:

Similarly, we can see how the activity is spread out over calendar months. This shows an obvious correlation to the slower periods in my life, which means that July is vacation times and the numbers show it:

cURL and libcurl

curl: ten years of more code and contributors

September 18, 2010 Daniel Stenberg

It feels like I’ve been doing curl forever, while in fact it is “only” in its early teens. I decided to dig up some numbers on how the development have been within the project over the last decade. How have things changed during the 10 most recent years.

To spice up the numbers, I generated some graphs based on them and to then make the graphs nice and presentable I moved them all over to a single graph using my super gimp powers.

Click the image to get a full resolution version. But even the small one shows the data I wanted to illustrate: we gain contributors in roughly the same speed as we grow in lines of code. And at the same time we get roughly the same amount of bug reports over the years, apparently independently from the amount of code and contributors! Note that I separate the bug fixed bars from the bug report bars because bug fixed is the amount of bugfixes mentioned in release notes while the bug reports is the count in the web based bug tracker. As seen we fixed a lot more bugs than we get submitted in the bug tracker.

I should add that the reason the green contributor line starts out a little slow and gets a speed bump after a while, is that I changed my way of working at that point and got much better at tracking exactly all contributors. The general angle on the curve for the last 4-5 years is however what I think is the interesting part of it. How it is basically the same angle as the source code increase.

The bug report counter is merely taken from our bug tracker at sourceforge, which is a very inexact count as a very large amount of bugs are reported on the mailing lists only.

Data from the curl release table, tells that during these 10 years we’ve done 77 releases in which we fixed 1414 bugs. That’s 18.4 bug fixes per release and one release roughly every 47 days on average. 141 bug fixes per year on average.

To see how this development has changed over time I decided to compare those numbers against those for the most recent 2.5 years. During this most recent 25% of the period we’ve done releases every 60 days on average but counting 155 bug fixes per year. Which made that the average number of bug fixes per release have gone up to 26; one bugfix every 2.3 days.

A more negative interpretation on this could be that we’re only capable of a certain amount of bug fixes per time so no matter how much code we get we fix bugs at roughly the same rate. The fact that we don’t get any increasing amount of bug reports of course speaks against this theory.

Network, Open Source

A view of a popular post

August 3, 2009 Daniel Stenberg 1 Comment

So I post frequently on this blog, but I’m not a particularly interesting person myself, I’m not really a master at writing and phrasing articles to make them thrilling and irresistible and I basically only deal with really geeky and technical subjects. It means there’s an average of perhaps 200 views per day.

The other day I wrote my multipath tcp post, and someone submitted it to reddit. It turned out to become my most read posting on my blog ever. By far. I think the “views per day” graph looks pretty cool:

visitor graph from daniel.haxx.se/blog

cURL and libcurl, Development

Some stats on curl development

January 22, 2009 Daniel Stenberg

Counting curl 6.0 and up to curl 7.19.3 we’ve done 78 releases during the 9.4 years it took.

In this time, we’ve mentioned 1259 bugfixes and 389 notable changes.

This makes one bugfix done every 2.7 days. One release done every 43rd day with an average of 16 bugfixes done in each. The longest interval ever between two curl releases was 139 days, back in 2000 when we worked to release the first version 7 release (known as 7.1).

To compare with how our work has been more recently, doing the same math limited to the 20 latest releases only (the 3.3 years since and including 7.15.0) shows that we’re still on 2.7 days per bugfix (although we know that the code base has grown steadily for years) but we’re now on 61 days between releases and 21 bugfixes/release…

All this info and more will be visible on a web page on the curl site soonish, I’m still working on polishing it up.

What other useful or useless but interesting numbers could be extracted from this?

Development, Open Source

4 ohloh improvements I’d like

November 28, 2008 Daniel Stenberg

I am a stats junkie so I like my stats in large amounts. But I like the stats to be right and as accurate as possible, and when I look at what ohloh produces I like the concepts and ideas in general, I just think their implementation is lacking in a few vital areas that need improvement:

1. There are no dependencies or hierarchies between packages, so “I use this” counters get worthless since people mark end-user packages they use. Low-level support packages and libraries that are used indirectly don’t get many “use counts”

2. Doing very few commits in a very well used project with few authors gives you way way more points than doing a bus-load of commits in something less used with many fellow contributors. This makes the top-list of people very skewed as some of the top-64 people only did a few hundred commits ever. I doubt many mortals would consider someone who only ever did 300 commits to be a top community person. At the very moment I write this, the #1 ranked person has done 20 commits during 5 months…!

3. Too few versioning systems are supported, leaving out huge chunks of the open source world. Bazaar, mercurial and a few more are a bit too popular to be ignored without the results getting skewed.

4. I’d like to see the “number of users” of products as a percentage, as the total number of users they show include all contributors to all projects. Out of the 140,000 users (which undoubtedly include a lot of duplicates), it would surprise me if more than 10,000 have actually registered what products they use. I’ve tried to find the exact number but I failed. So 3,000 users don’t mean 3,000 out of 140,000 but 3,000 out of 10,000…

Open Source

ohloh vs statcvs

October 30, 2008 Daniel Stenberg

I’ve played a bit with statcvs lately and I generated reports for the curl repository. It turned out rather interesting (well, assuming you’re a statistics geek such as me) especially in comparison to the data and stats ohloh.net presents for the same code:

[the images have been lost in time, like tears in rain]

Executive summary:

I’ve done 82% of all code changes.
We seem to grow at roughly the same pace (both number of code lines and number of files) over the last years.
The lines of code per file count seems rather fixed

Oh, that initial big bump at late 1999/early 2000 was due to a lot of “wrong” files such as configure, config.guess etc were committed and subsequently removed. It is a bit annoying to have there as it ruins the data somewhat but I’ve not managed to fool statcvs into ignoring that part…

Rockbox

Rockbox downloads April 2008

May 15, 2008 Daniel Stenberg

I counted the Rockbox downloads from build.rockbox.org during April 2008, and while the results weren’t very different from the past results, I thought I’d still show them. This month, 99874 downloads were counted and we had 30 different packages downloaded. Back in January, we still only had 26 versions. The top-5 are identical to the last list.

The most popular newcomer since my last count is the Olympus Mrobe 100 which has more than twice the number of downloads compared to the second newcomer iAudio m3.

The list shows model and number of downloads. The newcomers since the last count are shown bold.

sansae200 22038
ipodvideo 18289
ipodvideo64mb 12392
ipodnano 12261
sansac200 4176
h300 3071
ipodcolor 2932
ipodmini2g 2875
gigabeatf 2848
ipod4gray 2651
h120 2506
iaudiox5 2498
ipod3g 1717
ipodmini1g 1496
ipod1g2g 1411
h10 1361
h10_5gb 1268
mrobe100 1116
player 564
iaudiom3 528
recorder 500
iaudiom5 284
h100 275
recorder8mb 233
recorderv2 157
cowond2 138
fmrecorder 116
ondiofm 108
ondiosp 58
mrobe500 7

daniel.haxx.se

Tag Archives: statistics

A 1337 curl author

decomplexifying curl

Easy to read

Graphs graphs graphs

Improve

Complexity

Function length

Going forward

Keyboard key frequency

Conclusions

Random “anecdotes”

More

curlers rest on Sundays and during July

curl: ten years of more code and contributors

A view of a popular post

Some stats on curl development

4 ohloh improvements I’d like

ohloh vs statcvs

Rockbox downloads April 2008

curl, open source and networking