The Open Source Census Report

I’d never heard about the Open Source Census before when I fell over a mention of their recent report somewhere. Their mission is to get “enterprises” to install their little client which scans computers for open source products and reports the findings back to a central server.

Anyway, their current database consists of a “mere” 2300 machines scanned but that equals a total of 314,000 open source installations. 768 different packages are identified. The top-10 found products are:

  1. firefox 84.4%
  2. zlib 65.75%
  3. xerces 61.24%
  4. wget 61.12%
  5. xalan 58.19%
  6. prototype 57.03%
  7. activation 53.01%
  8. javamail 50.15%
  9. openssl 46.45%
  10. docbook-xml 46.27%

Ok, as an open source hacker and a geek, there are two things we need to do here: 1) find out how our own projects rank among the others and 2) how the scanning is done and thus how good it is. Thankfully all this is possible due to the entire data set being downloadable for free and the client being fully open source.

find out how our own projects rank

“curl” was found on 18.19% of all computers. That makes it #81 on the list, just below virtualbox and wireshark, but immediately above jstl and busybox. This includes “All Versions” of all tools, and for curl’s sake that was 22 different versions!

I found no other project I do anything noticeable in. Subversion is at #44.

how the scanning is done

It’s quite simple. It scans for file names based on a file name pattern and then it pattern matches contents of those files. It also extracts version numbers for the files using those regex patterns. You can see the full set of patterns/rules in the XML file straight off their source code repository: project-rules.xml.

how good is it

With this specific patterns for binary contents they of course need special human treatment for many versions and that is of course error-prone. That could explain why no curl version of the latest version (7.19.0) was reported. It will also cause renamed tools to remain undetected.

In my particular case I would of course also like to know how much libcurl is used, but they don’t seem to check for that (I found several projects besides the curl tool that I know use libcurl).

All this said, I didn’t actually try out the client myself so I haven’t verified it for real.

ohloh vs statcvs

I’ve played a bit with statcvs lately and I generated reports for the curl repository. It turned out rather interesting (well, assuming you’re a statistics geek such as me) especially in comparison to the data and stats ohloh.net presents for the same code:

Executive summary:

  • I’ve done 82% of all code changes.
  • We seem to grow at roughly the same pace (both number of code lines and number of files) over the last years.
  • The lines of code per file count seems rather fixed

Oh, that initial big bump at late 1999/early 2000 was due to a lot of “wrong” files such as configure, config.guess etc were committed and subsequently removed. It is a bit annoying to have there as it ruins the data somewhat but I’ve not managed to fool statcvs into ignoring that part…

The NFSA 2008 went to…

The Nordic Free Software Award 2008 went to Mats Östling for programverket.org which is “a project operating with open software and open software development in the public sector. The purpose is to achieve more collaboration and more efficient IT application within the public sector“. Congratulations Mats!

The FSCONS official site (the award was handed out during that event) keeps up with its tradition with being totally behind the schedule and isn’t even mentioning the winner yet…

I’m not sure only two awards is enough to draw any conclusions, but with Skolelinux last year and a public sector open source project this year it certainly gives a feeling what the jury has prioritized so far.

Snaxx 19

In an attempt at making something social, to actually meet up with real-life physical people buHaxx!t yet avoid common trivial subjects and only stay on-topic with technology, computing, work, beer and things related to that, we’re gathering at the next Snaxx on november 20th somewhere in Stockholm city Sweden. The exact location has yet to be decided.

If you’re into technology, open source, good ales, talking about work on your spare time or possibly all of that at once – then you might just be one of us.

Welcome!

Estimated-Content-Length

Greg Dean posted an interesting idea on the ietf-http-wg mailing list, suggesting that a new response header would be added to HTTP (Estimated-Content-Length:) to allow servers to indicate a rough estimation of the content length in situation where it doesn’t actually now the exact size before it starts sending data.

In the current world, HTTP servers can only report the exact size to the client or no size at all and then the client will have to just deal with the response becoming any size at all. It then has no way to know even roughly how large the data is or how long the transfer is going to take.

The discussions following Greg’s post seem mostly positive thus far from several people.

In the middle there is a man

The other day an interesting bug report was posted against the Firefox browser, and it caused some interesting discussions and blog posts on the subject of Man-In-The-Middle attacks and how current browsers etc make it (too?) easy to accept self-signed certificates and thus users are easily mislead. (Peter Burkholder wrote a great piece on SSL MITMing already back in 2002 which goes into detail on how this can be done.).

The entire issue essentially boils down to this:

To be able to really know that you’re communicating with the true remote site (and not an impostor), you must have some kind of verification system.

In SSL land we have this system with CA certs for verifying certs and it works pretty good in most cases (I think). However, so many sites on the internet use HTTPS today without having the certificate signed by a party that is known to the browser already – most of them are so called self-signed which means there’s nobody else that guarantees that they are who they claim to be, just themselves.

All current modern browsers want to give the users easy access to HTTP sites, to HTTPS sites with valid properly-signed certs but also allow users to connect to and browse on HTTPS sites with self-signed certs. And here comes the problem: how to tell users that HTTPS with self-signed certs is very insecure but still let them proceed? How do we tell them that the user may proceed but if this is a known popular site you really should expect a true and valid certificate as otherwise it is quite possibly a MITM you’re seeing?

People are so used to just accept exceptions and click away nagging pop-ups so the warnings and alerts that are explicit and implied by the prompts you have to go through to accept the self-signed certificate. They don’t seem to have much effect. As can be seen in this bug report, accepting an impostor’s certificate for a large known site is far too easy.

In the SSH land however, we don’t have the ca cert system and top-down trust hierarchy that SSL/TLS imposes. But does this matter? I’d say no, as most if not all users still don’t reflect much over the fact when a server’s host key is reported different than what you used before. Or when you connect to a host the first time you accept the host key without trying to verify it using a different channel. Thus you’re subject to pretty much the same MITM risk. The difference is perhaps that less “mere end users” are using SSH this way.

Let me just put emphasis on this: SSL and SSH are secure. The insecureness here is not due to how the protocols work, but rather they are flaws that appear when we mix in real world users and UIs and so.

I don’t have any sensible solutions to these problems myself. I’m crap at designing things for mere humans and UIs etc and I make no claims of understanding end users.

It seems there’s a nice tool called ettercap that’s supposedly a fine thing to use when you want to run your own MITM attacks on your LAN! And on the other side: an interesting take at improving the “accept this certificate” UI is offered by the Firefox’s Perspectives plugin which basically also checks with N other sources’ view to help you decide whether to trust a certificate.

I want to round off my rant with a little quote:

I have little, and decreasing, desire to continue to invest in strong security for a product that discards that security for the masses” [*] / Nelson B Bolyard – prominent NSS hacker

strcasecmp in Turkish

A friendly user submitted the (lib)curl bug report #2154627 which identified a problem with our URL parser. It doesn’t treat “file://” as a known protocol if the locale in use is Turkish.

This was the beginning of a minor world-moving revelation for me. Of course this is already known to mankind and I’m just behind, but really: lots of my fellow hacker friends had no idea either.

So “file” and “FILE” are not the same word case insensitively in Turkish because ‘i’ is not the lowercase version of ‘I’.

Back to strcasecmp: POSIX pretty much makes the function useless by saying that “The results are unspecified in other locales [than POSIX]”.

I’m a bit annoyed by this fact, as now I have to introduce my own function (which thus cannot use tolower() or toupper() since they also are affected by the locale) and use since the strings in our code is clearly used for “English” strings so file and FILE truly are the same string when compared case insensitively…

Copyleft and closed dual license ethics

There are a bunch of companies out there today that offer their products in a dual-license style, where you can download and use the GPL licensed version or buy the proprietary licensed version (often together with some kind of service deal) that you then can use without the “burden” of a GPL agreement. Popular known brands doing this include Trolltech/Qt (now Nokia), MySQL (now Sun), OO.o (Sun), Sleepycat (now Oracle) (Berkely DB is not strictly GPL but still copyleft) and VirtualBox (now Sun) etc.

It’s perfectly legal for them to do this, as the company is the copyright holder of all the files, they can just easily re-release everything under whatever license they want at their own discretion. The condition is of course that they are in fact copyright holders of everything, that the parts they don’t have copyright for are either licensed under an enough liberal license or that they can buy a similar relicense from third party lib authors.

It kills contributions from non-employees since doing a large chunk of code for these guys means that you would hand over copyright to a company whose entire business idea is to convert that to a proprietary license and make money from it. In a way you cannot do yourself since they can turn the GPL code into proprietary goods and you cannot. This may be a clue to why MySQL has less community contributors. The forced assigning of copyright over to a company could very well also be a contributing factor to OO.o’s problems to attract developers.

Companies “hide” the truth about this and try talking customers into the proprietary license. I’ve worked a bit with Qt and the wording they have used have often given companies the impression that they have to pay for the proprietary licensed version to be allowed to use the product in a commercial product. I’ve had to explain to several customers that as long as they just adhere to GPL they can use the free version just fine without paying anything. Trolltech also has this dubious condition tied to their commercial license: “The Commercial license does not allow the incorporation of code developed with the Open Source Edition of Qt into a commercial product.“[*] Needless to say, this will prevent companies from trying the open source licensed route first. I’m curious if they even have the legal right to make that claim.

This puts competitors at an arm’s distance of course since no other companies can take the code and conduct business the same way. Of course this is part of the reason why they gladly adapt GPL for this. Lots of actions by these companies make me feel that they aren’t real and true open source believers, but that they use this label a lot for marketing and for making sure competitors can’t do the same as they do.

The GPL version is without support for customers in another push to drive them to pay for the proprietary license instead of the GPL one. Of course, it being open source lets companies going the GPL route to fix their own problems since they have the source and all, but the push towards the proprietary license also narrows how many customers that will actively contribute anything back since there’s little chance they will do anything in a project with a proprietary license. I honestly can’t see many other possible legitimate reasons why companies wouldn’t do support for the GPL licensed versions.

I’ve not personally worked in any of these projects under such proprietary licenses, but I would love to hear experiences from people that have!

Obviously all this are not problems large enough to concern users. Quite possibly so because these companies do a good enough job and keep the GPL versioned versions of their software at a sufficiently good quality so that there just don’t appear any forked projects that take the GPL version and run with it in a different direction. Another explanation could be that there are good enough alternative projects to go with if you’re not happy with one of these dual-licensed ones.

A little related anecdote told to me by an MySQL employee (who’s name shall remain untold). He described how they still haven’t implemented a feature in MySQL that many people have requested, since they according to him don’t want to cram in more stuff in the existing branch but instead are releasing it in the next major release (due to release in 4-6 months or similar). In the next sentence he explained how they already have it implemented in the closed version for at least one paying customer… Any (other) true open source project would’ve made that change available as a patch/branch in the GPL version for the public.

I’m pretty sure I personally would release my patches as open source only if I would change any code for any of these products. But yeah, that would mean that they would never get incorporated into their “real” products…

Nordic Free Software Award Nominee 2008

It seems I’m again (as I was last year) nominated for the Nordic Free Software Award.

They list thirteen nominees, of which there are four organizations/companies. I’m proud to be mentioned in such a swell company.

Unfortunately I cannot be present at the FSCONS itself this year (where the award is being handed over), so all the partying and celebrating the award winner will have to be done without me! 🙂

tech, open source and networking