Tag Archives: Open Source

The Polhem prize, one year later

On September 25th 2017, I received the email that first explained to me that I had been awarded the Polhem Prize.

Du har genom ett omfattande arbete vaskats fram som en värdig mottagare av årets Polhemspris. Det har skett genom en nomineringskommitté och slutligen ett råd med bred sammansättning. Priset delas ut av Kungen den 19 oktober på Tekniska muséet.

My attempt of an English translation:

You have been selected as a worthy recipient of this year's Polhem prize through extensive work. It has been through a nomination committee and finally a council of broad composition. The prize is awarded by the King on October 19th at the Technical Museum.

A gold medal

At the award ceremony in October 2017 I received the gold medal at the most fancy ceremony I could ever wish for, where I was given the most prestigious award I couldn’t have imagined myself even being qualified for, handed over by no other than the Swedish King.

An entire evening with me in focus, where I was the final grand finale act and where my life’s work was the primary reason for all those people being dressed up in fancy clothes!

Things have settled down since. The gold medal has started to get a little dust on it where it lies here next to me on my work desk. I still glance at it every once in a while. It still feels surreal. It’s a fricking medal in pure gold with my name on it!

I almost forget the money part of the prize. I got a lot of money as well, but in retrospect it is really the honors, that evening and the gold medal that stick best in my memory. Money is just… well, money.

So did the award and prize make my life any different? Yes sure, a little, and I’ll tell you how.

What’s all that time spent on?

My closest surrounding of friends and family got a better understanding of what I’ve actually been doing all these long hours, all these years and more than one phrase in the style of “oh, so you actually did something useful?!” have been uttered.

Certainly I’ve tried to explain to them before, but nothing works as good as a gold medal from an award committee to say that what I do is actually appreciated “out there” and it has made a serious impact on the world.

I think I’m considered a little less weird now when I keep spending night hours in front of my computer when the house is otherwise dark and silent. Well, maybe still weird, but at least my weirdness has proven to result in something useful for mankind and that’s more than many other sorts of weird do… We all have hobbies.

What is curl?

Family and friends have gotten a rudimentary level of understanding of what curl is and what it does. I’m not suggesting they fully grasp it or know what an “internet protocol” is now, but at least a lot of people understand that it works with “internet transfers”. It’s not like people were totally uninterested before, but when I was given this prize – by a jury of engineers no less – that says this is a significant invention and accomplishment with a value that “can not be overestimated“, it made them more interested. The little video that was produced helped:

Some mysteries remain

People in general still have a hard time to grasp the reach of the project, how much time I’ve spent so far on it, how I can find motivation to keep up the work and not the least how this is all given away for free for everyone.

The simple fact that these are all questions that I’ve been asked I think is a small reward in itself. I think the fact that I was awarded this prize for my work on Open Source is awesome and I feel honored to be a person who introduces this way of thinking to some of the people who previously would think that you have to sell proprietary things or earn a lot of money for your products in order to impact and change society as a whole.

Not widely known

The Polhem prize is not widely known in Sweden among the general populace and thus neither is the fact that I won it. Only a very special subset of people know about this. Of course it is even less known outside of Sweden and in fact the information about the prize given in English is very sparse.

Next year’s winner

The other day I received my invitation to participate in this year’s award ceremony on November 14. Of course I’ll happily accept that and I will be there and celebrate the winner this year!

The curl project

How did the prize affect the project itself, the project that I was awarded for having cared for this long?

It hasn’t affected it much at all (as far as I can tell). The project has moved along like before and we’ve worked on fixing bugs and added features and cool things over time after my award just as we did before it. That’s how it has felt like. Business as usual.

If anything, I think I might have gotten some renewed energy and interest in the project and the commit author statistics actually show that my commit frequency has gone up since around the time I got the award. Our gitstats show that I’ve done more than half of the commits every single month the last year, most of this time even more than 70% of the commits.

I may have served twenty years here, but I’m not done yet!

Nordic Free Software Award reborn

Remember the glorious year 2009 when I won the Nordic Free Software Award?

This award tradition that was started in 2007 was put on a hiatus after 2010 (I believe) and there has not been any awards handed out since, and we have not properly shown our appreciation for the free software heroes of the Nordic region ever since.

The award has now been reignited by Jonas Öberg of FSFE and you’re all encourage to nominate your favorite Nordic free software people!

Go ahead and do it right away! You only have to the end of February so you better do it now before you forget about it.

I’m honored to serve on the award jury together with previous award winners.

This year’s Nordic Free Software Award winner will be announced and handed their prize at the FOSS-North conference on April 23, 2018.

(Okay, yes, the “photo” is a montage and not actually showing a real trophy.)

The backdoor threat

— “Have you ever detected anyone trying to add a backdoor to curl?”

— “Have you ever been pressured by an organization or a person to add suspicious code to curl that you wouldn’t otherwise accept?”

— “If a crime syndicate would kidnap your family to force you to comply, what backdoor would you be be able to insert into curl that is the least likely to get detected?” (The less grim version of this question would instead offer huge amounts of money.)

I’ve been asked these questions and variations of them when I’ve stood up in front of audiences around the world and talked about curl and how it is one of the most widely used software components in the world, counting way over three billion instances.

Back door (noun)
— a feature or defect of a computer system that allows surreptitious unauthorized access to data.

So how is it?

No. I’ve never seen a deliberate attempt to add a flaw, a vulnerability or a backdoor into curl. I’ve seen bad patches and I’ve seen patches that brought bugs that years later were reported as security problems, but I did not spot any deliberate attempt to do bad in any of them. But if done with skills, certainly I wouldn’t have noticed them being deliberate?

If I had cooperated in adding a backdoor or been threatened to, then I wouldn’t tell you anyway and I’d thus say no to questions about it.

How to be sure

There is only one way to be sure: review the code you download and intend to use. Or get it from a trusted source that did the review for you.

If you have a version you trust, you really only have to review the changes done since then.

Possibly there’s some degree of safety in numbers, and as thousands of applications and systems use curl and libcurl and at least some of them do reviews and extensive testing, one of those could discover mischievous activities if there are any and report them publicly.

Infected machines or owned users

The servers that host the curl releases could be targeted by attackers and the tarballs for download could be replaced by something that carries evil code. There’s no such thing as a fail-safe machine, especially not if someone really wants to and tries to target us. The safeguard there is the GPG signature with which I sign all official releases. No malicious user can (re-)produce them. They have to be made by me (since I package the curl releases). That comes back to trusting me again. There’s of course no safe-guard against me being forced to signed evil code with a knife to my throat…

If one of the curl project members with git push rights would get her account hacked and her SSH key password brute-forced, a very skilled hacker could possibly sneak in something, short-term. Although my hopes are that as we review and comment each others’ code to a very high degree, that would be really hard. And the hacked person herself would most likely react.

Downloading from somewhere

I think the highest risk scenario is when users download pre-built curl or libcurl binaries from various places on the internet that isn’t the official curl web site. How can you know for sure what you’re getting then, as you couldn’t review the code or changes done. You just put your trust in a remote person or organization to do what’s right for you.

Trusting other organizations can be totally fine, as when you download using Linux distro package management systems etc as then you can expect a certain level of checks and vouching have happened and there will be digital signatures and more involved to minimize the risk of external malicious interference.

Pledging there’s no backdoor

Some people argue that projects could or should pledge for every release that there’s no deliberate backdoor planted so that if the day comes in the future when a three-letter secret organization forces us to insert a backdoor, the lack of such a pledge for the subsequent release would function as an alarm signal to people that something is wrong.

That takes us back to trusting a single person again. A truly evil adversary can of course force such a pledge to be uttered no matter what, even if that then probably is more mafia level evilness and not mere three-letter organization shadiness anymore.

I would be a bit stressed out to have to do that pledge every single release as if I ever forgot or messed it up, it should lead to a lot of people getting up in arms and how would such a mistake be fixed? It’s little too irrevocable for me. And we do quite frequent releases so the risk for mistakes is not insignificant.

Also, if I would pledge that, is that then a promise regarding all my code only, or is that meant to be a pledge for the entire code base as done by all committers? It doesn’t scale very well…

Additionally, I’m a Swede living in Sweden. The American organizations cannot legally force me to backdoor anything, and the Swedish versions of those secret organizations don’t have the legal rights to do so either (caveat: I’m not a lawyer). So, the real threat is not by legal means.

What backdoor would be likely?

It would be very hard to add code, unnoticed, that sends off data to somewhere else. Too much code that would be too obvious.

A backdoor similarly couldn’t really be made to split off data from the transfer pipe and store it locally for other systems to read, as that too is probably too much code that is too different than the current code and would be detected instantly.

No, I’m convinced the most likely backdoor code in curl is a deliberate but hard-to-detect security vulnerability that let’s the attacker exploit the program using libcurl/curl by some sort of specific usage pattern. So when triggered it can trick the program to send off memory contents or perhaps overwrite the local stack or the heap. Quite possibly only one step out of several steps necessary for a successful attack, much like how a single-byte-overwrite can lead to root access.

Any past security problems on purpose?

We’ve had almost 70 security vulnerabilities reported through the project’s almost twenty years of existence. Since most of them were triggered by mistakes in code I wrote myself, I can be certain that none of those problems were introduced on purpose. I can’t completely rule out that someone else’s patch modified curl along the way and then by extension maybe made a vulnerability worse or easier to trigger, could have been made on purpose. None of the security problems that were introduced by others have shown any sign of “deliberateness”. (Or were written cleverly enough to not make me see that!)

Maybe backdoors have been planted that we just haven’t discovered yet?

Discussion

Follow-up discussion/comments on hacker news.

curl author activity illustrated

At the time of each commit, check how many unique authors that had a change committed within the previous 120, 90, 60, 30 and 7 days. Run the script on the curl git repository and then plot a graph of the data, ranging from 2010 until today. This is just under 10,000 commits.

(click for the full resolution version)

git-authors-active.pl is the little stand-alone script I wrote and used for this – should work fine for any git repository. I then made the graph from that using libreoffice.

The curl bus factor

bus factor: the minimum number of team members that have to suddenly disappear from a project before the project stalls due to lack of knowledgeable or competent personnel.

Projects should strive to survive

If a project is worth using and deploying today and if it is a project worth sending patches to right now, it is also a project that should position itself to survive a loss of key individuals. However unlikely or unfortunate such an event would be.

Tools to calculate bus factor

All the available tools that determine the bus factor for a given project only run on code and check for commits, code churn or check how many files each person has done a significant share of changes in etc.

This number is really impossible to figure out without tools and tools really cannot take “general knowledge” into account, or “this person answers a lot of email on the list”, or this person has 48k in reputation on stack overflow already for responding to questions about the project.

The bus factor as evaluated by a tool pretty much has to be about amount of code, size of code or number of code changes, which may or may not be a good indicator of who knows what about the code. Those who author and commit changes probably have a good idea but a real problem is that you can’t reverse that view and say that just because you didn’t commit or change something, you don’t know. Do you know more about the code if you did many commits? Do you know more about the code if you changed more lines of code?

We can’t prove or assume lack of knowledge or interest by an absence of commits, edits or changes. And yet we can’t calculate bus factor if there’s no tool or way to calculate it.

A look at curl

curl is soon 20 years old and boasts 22k something commits. I’m the author of about 57% of them, and the second-most committer (who’s not involved anymore) has about 12%. That makes two committers having done 15.3k commits out of the 22k. If we for simplicity calculate bus factor based on commit numbers, we’d need 8580 commits from others and I would stop completely, to reach bus factor >2 (when the 2 top committers have less than 50% of the commits), which at the current commit rate equals in about 5 years. And it would take about 3 years to just push the factor above 1. So even when new people joins the project, they have a really hard time to significantly change the bus factor…

The image above shows the relative share of commits done in the curl project’s git source code repository (as a share of the total amount) by the top 4 commiters from January 1 2010 to July 5 2017 (click for higher resolution). The top dotted line shows the combined share of all four (at 82% right now) and the dark blue line is my share. You can see how my commit share has shrunk from 72% down to 57% over these last 7.5 years. If this trend holds, I’ll have less than 50% of the total commits done in curl in 3-4 years.

At the same time, the thicker light blue line that climbs up into the right is the total number of authors in the git repository, which recently surpassed 500 as you can see. (The line uses the right Y-axes)

We’re approaching 1600 individually named contributors thanked in the project and every release we do (we ship one every 8 weeks) has around 40 contributors, out of which typically around half are newcomers. The long tail is very long and the amount of drive-by just-once contributors is high. Also note how the number 1600 is way higher than the 500 something that has authored commits. Lots of people contribute in other ways.

When we ask our users “why don’t you contribute (more) to the project?” (which we do annually) what do they answer? They say its because 1) everything works, 2) I don’t have time 3) things get fixed fast enough 4) I don’t know the programming language 5) I don’t have the energy.

First as the 6th answer (at 5% 2017) comes “other” where some people actually say they wouldn’t know where to start and so on.

All of this taken together: there are no visible signs of us suffering from having a low bus factor. Lots of signs that people can do things when they want to if I don’t do it. Lots of signs that the code and concepts are understood.

Lots of signs that a low bus factor is not a big problem here. Or perhaps rather that the bus factor isn’t really as low as any tool would calculate it.

What if I…

Do I know who would pick up the project and move on if I die today? No. We’re a 100% volunteer-driven project. We create one of the world’s most widely used software components (easily more than three billion instances and counting) but we don’t know who’ll be around tomorrow to work on it. I can’t know because that’s not how the project works.

Given the extremely wide use of our stuff, given the huge contributor base, given the vast amounts of documentation and tests I think it’ll work out.

Just because you have a large bus factor doesn’t necessarily make the project a better place to ask questions. We’ve seen projects in the past where N persons involved are all from the same company and when that company removes its support for that project those people all go away. High bus factor, no people to ask.

Finally, let me just add that I would of course love to have many more committers and contributors in the curl project, and I think we would be an even better project if we did. But that’s a separate issue.

Removing the PowerShell curl alias?

PowerShell is a spiced up command line shell made by Microsoft. According to some people, it is a really useful and good shell alternative.

Already a long time ago, we got bug reports from confused users who couldn’t use curl from their PowerShell prompts and it didn’t take long until we figured out that Microsoft had added aliases for both curl and wget. The alias had the shell instead invoke its own command called “Invoke-WebRequest” whenever curl or wget was entered. Invoke-WebRequest being PowerShell’s own version of a command line tool for fiddling with URLs.

Invoke-WebRequest is of course not anywhere near similar to neither curl nor wget and it doesn’t support any of the command line options or anything. The aliases really don’t help users. No user who would want the actual curl or wget is helped by these aliases, and user who don’t know about the real curl and wget won’t use the aliases. They were and remain pointless. But they’ve remained a thorn in my side ever since. Me knowing that they are there and confusing users every now and then – not me personally, since I’m not really a Windows guy.

Fast forward to modern days: Microsoft released PowerShell as open source on github yesterday. Without much further ado, I filed a Pull-Request, asking the aliases to be removed. It is a minuscule, 4 line patch. It took way longer to git clone the repo than to make the actual patch and submit the pull request!

It took 34 minutes for them to close the pull request:

“Those aliases have existed for multiple releases, so removing them would be a breaking change.”

To be honest, I didn’t expect them to merge it easily. I figure they added those aliases for a reason back in the day and it seems unlikely that I as an outsider would just make them change that decision just like this out of the blue.

But the story didn’t end there. Obviously more Microsoft people gave the PR some attention and more comments were added. Like this:

“You bring up a great point. We added a number of aliases for Unix commands but if someone has installed those commands on WIndows, those aliases screw them up.

We need to fix this.”

So, maybe it will trigger a change anyway? The story is ongoing…

On billions and “users”

At times when I’ve gone out (yes it happens), faced an audience and talked about my primary spare time project curl, I’ve said a few times in the past that we have one billion users.

Users?

many devices

OK, as this is open source I’m talking about, I can’t actually count my users and what really constitutes “a user” anyway?

If the same human runs multiple copies of curl (in different devices and applications), is that human then counted once or many times? If a single developer writes an application that uses libcurl and that application is used by millions of humans, is that one user or are they millions of curl users?

What about pure machine “users”? In the subway in one of the world’s largest cities, there’s an automated curl transfer being done for every person passing the ticket check point. Yet I don’t think we can count the passing (and unknowing) passengers as curl users…

I’ve had a few people approach me to object to my “curl has one billion users” statement. Surely not one in every seven humans on earth are writing curl command lines! We’re engineers and we’re picky with the definitions.

Because of this, I’m trying to stop talking about “number of users”. That’s not a proper metric for a project whose primary product is a library that is used by applications or within devices. I’m instead trying to assess the number of humans that are using services, tools or devices that are powered by curl. Fun challenge, right?

Who isn’t using?

a userI’ve tried to imagine of what kind of person that would not have or use any piece of hardware or applications that include curl during a typical day. I certainly can’t properly imagine all humans in this vast globe and how they all live their lives, but I quite honestly think that most internet connected humans in the world own or use something that runs my code. Especially if we include people who use online services that use curl.

curl is used in basically all modern TVs, a large percentage of all car infotainment systems, routers, printers, set top boxes, mobile phones and apps on them, tablets, video games, audio equipment, Blu-ray players, hundreds of applications, even in fridges and more. Apple alone have said they have one billion active devices, devices that use curl! Facebook uses curl extensively and they have 1.5 billion users every month. libcurl is commonly used by PHP sites and PHP empowers no less than 82% of the sites w3techs.com has figured out what they run (out of the 10 million most visited sites in the world).

There are about 3 billion internet users worldwide. I seriously believe that most of those use something that is running curl, every day. Where Internet is less used, so is of course curl.

Every human in the connected world, use something powered by curl every day

Frigging Amazing

It is an amazing feeling when I stop and really think about it. When I pause to let it sink in properly. My efforts and code have spread to almost every little corner of the connected world. What an amazing feat and of course I didn’t think it would reach even close to this level. I still have hard time fully absorbing it! What a collaborative success story, because I could never have gotten close to this without the help from others and the community we have around the project.

But it isn’t something I think about much or that make me act very different in my every day life. I still work on the bug reports we get, respond to emails and polish off rough corners here and there as we go forward and keep releasing new curl releases every 8 weeks. Like we’ve done for years. Like I expect us and me to continue doing for the foreseeable future.

It is also a bit scary at times to think of the massive impact it could have if or when a really terrible security flaw is discovered in curl. We’ve had our fair share of security vulnerabilities so far through our history, but we’ve so far been spared from the really terrible ones.

So I’m rich, right?

coins

If I ever start to describe something like this to “ordinary people” (and trust me, I only very rarely try that), questions about money is never far away. Like how come I give it away free and the inevitable “what if everyone using curl would’ve paid you just a cent, then…“.

I’m sure I don’t need to tell you this, but I’ll do it anyway: I give away curl for free as open source and that is a primary reason why it has reached to the point where it is today. It has made people want to help out and bring the features that made it attractive and it has made companies willing to use and trust it. Hadn’t it been open source, it would’ve died off already in the 90s. Forgotten and ignored. And someone else would’ve made the open source version and instead filled the void a curlless world would produce.

No more heartbleeds please

caution-quarantine-areaAs a reaction to the whole Heartbleed thing two years ago, The Linux Foundation started its Core Infrastructure Initiative (CII for short) with the intention to help track down well used but still poorly maintained projects or at least detect which projects that might need help. Where the next Heartbleed might occur.

A bunch of companies putting in money to improve projects that need help. Sounds almost like a fairy tale to me!

Census

In order to identify which projects to help, they run their Census Project: “The Census represents CII’s current view of the open source ecosystem and which projects are at risk.

The Census automatically extracts a lot of different meta data about open source projects in order to deduce a “Risk Index” for each project. Once you’ve assembled such a great data trove for a busload of projects, you can sort them all based on that risk index number and then you basically end up with a list of projects in a priority order that you can go through and throw code at. Or however they deem the help should be offered.

Which projects will fail?

The old blog post How you know your Free or Open Source Software Project is doomed to FAIL provides such a way, but it isn’t that easy to follow programmatically. The foundation has its own 88 page white paper detailing its methods and algorithm.

Risk Index

  • A project without a web site gets a point
  • If the project has had four or more CVEs (publicly disclosed security vulnerabilities) since 2010, it receives 3 points and if fewer than four there’s a diminishing scale.
  • The number of contributors the last 12 months is a rather heavy factor, which thus could make the index grow old fairly quick. 3 contributors still give 4 points.
  • Popular packages based on Debian’s popcon get points.
  • If the project’s main language is C or C++, it gets two points.
  • Network “exposed” projects get points.
  • some additional details like dependencies and how many outstanding patches not accepted upstream that exist

All combined, this grades projects’ “risk” between 0 and 15.

Not high enough resolution

Assuming that a larger number of CVEs means anything bad is just wrong. Even the most careful and active projects can potentially have large amounts of CVEs. It means they disclose what they find and that people are actually reviewing code, finding problems and are reporting problems. All good things.

Sure, security problems are not good but the absence of CVEs in a project doesn’t say that the project is one bit more secure. It could just mean that nobody ever looked closely enough or that the project doesn’t deal with responsible disclosure of the problems.

When I look through the projects they have right now, I get the feeling the resolution (0-15) is too low and they’ve shied away from more aggressively handing out penalty based on factors we all recognize in abandoned/dead projects (some of which are decently specified in Tom Calloway’s blog post mentioned above).

The result being that the projects get a score that is mostly based on what kind of project it is.

But this said, they have several improvements to their algorithm already suggested in their issue tracker. I firmly believe this will improve over time.

The riskiest ?

The top three projects, the only ones that scores 13 right now are expat, procmail and unzip. All of them really small projects (source code wise) that have been around since a very long time.

curl, being the project I of course look out for, scores a 9: many CVEs (3), written in C (2), network exposure (2), 5+ apps depend on it (2). Seriously, based on these factors, how would you say the project is situated?

In the sorted list with a little over 400 projects, curl is rated #73 (at the time of this writing at least). Just after reportbug but before libattr1. [curl summary – which is mentioning a very old curl release]

But the list of projects mysteriously lack many projects. Like I couldn’t find neither c-ares nor libssh2. They may not be super big, but they’re used by a bunch of smaller and bigger projects at least, including curl itself.

The full list of projects, their meta-data and scores are hosted in their repository on github.

Benefits for projects near me

I can see how projects in my own backyard have gotten some good out of this effort.

I’ve received some really great bug reports and gotten handed security problems in curl by an individual who did his digging funded by this project.

I’ve seen how the foundation sponsored a test suite for c-ares since the project lacked one. Now it doesn’t anymore!

Badges!

In addition to that, the Linux Foundation has also just launched the CII Best Practices Badge Program, to allow open source projects to fill in a bunch of questions and if meeting enough requirements, they will get a “badge” to boast to the world as a “well run project” that meets current open source project best practices.

I’ve joined their mailing list and provided some of my thoughts on the current set of questions, as I consider a few of them to be, well, lets call them “less than optimal”. But then again, which project doesn’t have bugs? We can fix them!

curl is just now marked as “100% compliance” with all the best practices listed. I hope to be able to keep it like that even with future and more best practices added.

Turn many pictures into a movie

Challenge: you have 90 pictures of various sizes, taken in different formats and shapes. Using all sorts strange file names. Make a movie out of all of them, with the images using the correct aspect ratio. And add music. Use only command line tools on Linux.

Solution: this is a solution, you can most likely solve this in 22 other ways as well. And by posting it here, I can find it myself if I ever want to do the same stunt again…

#!/bin/sh

j=0
# convert options
pic="-resize 1920x1080 -background black -gravity center -extent 1920x1080"

# loop over the images
for i in `ls *jpg | sort -R`; do
 echo "Convert $i"
 convert $pic $i "pic-$j.jpg"
 j=`expr $j + 1`
done

# now generate the movie
mp3="file.mp3"
echo "make movie"
ffmpeg -framerate 3 -i pic-%d.jpg -i $mp3 -acodec copy -c:v libx264 -r 30 -pix_fmt yuv420p -s 1920x1080 -shortest out.mp4

Explained

This is a shell script.

The ‘pic’ variable holds command line options for the ImageMagick ‘convert‘ tool. It resizes each picture to 1920×1080 while maintaining aspect ratio and if the pic gets smaller, it is centered and gets a black border.

The loop goes through all files matching *,jpg, randomizes the order with ‘sort’ and then runs ‘convert’ on them one by one and calls the output files pic-[number].jpg where number is increased by one for each image.

Once all images have the correct and same size, ‘ffmpeg‘ is invoked. It is told to produce a movie with 3 photos per second, how to find all the images, to include an mp3 file into the output and to stop encoding when one of the streams ends – this assumes the playing time of the mp3 file is longer than the total time the images are shown so the movie stops when we run out of images to show.

Result

The ‘out.mp4’ file, uploaded to youtube could then look like this:

(music by Bensound.com)