Tag Archives: source code

Copyright without years

Like so many other software projects the curl project has copyright mentions at the top of almost every file in the source code repository. Like

Copyright (C) 1998 - 2022, Daniel Stenberg ...

Over the years we have used a combination of scripts and manual edits to update the ending year in that copyright line to match the year of the latest update of that file.

As soon as we started a new year and someone updated a file, the copyright range needed update. Scripts and tools made it less uncomfortable, but it was always somewhat of a pain to remember and fix.

In 2023 this changed

When the year was again bumped and the first changes of the year were done to curl, we should then consequentially start updating years again to make ranges end with 2023.

Only this time someone asked me why? and it made me decide that what the heck, let’s completely rip them out instead! Doing it at the beginning of the year is also a very good moment.

Do we need the years?

The Berne Convention states that copyright “must be automatic; it is prohibited to require formal registration”.

The often-used copyright lines are not necessary to protect our rights. According to the Wikipedia page mentioned above, the Berne Convention has been ratified by 181 states out of 195 countries in the world.

They can still serve a purpose as they are informational and make the ownership question quite clear. The year ranges add questionable value though.

I have tried to find resources that argue for the importance of the copyright years to be stated and present, but I have not found any credible sources. Possibly because I haven’t figured out where to look.

Not alone

It turns out quite a few projects run by many different organizations or even huge companies have already dropped the years from their source code header copyright statements. Presumably at least some of those giant corporations have had their legal departments give a green light to the idea before they went ahead and published source code that way to the world.

Low risk

We own the copyrights no matter if the years are stated or not. The exact years the files were created or edited can still easily be figured out since we use version control, should anyone ever actually care about it. And we give away curl for free, under an extremely liberal license.

I don’t think we risk much by doing this move.

January 3, 2023

On this day I merged commit 2bc1d775f510, which updated 1856 files and removed copyright years from almost everywhere in the source code repository.

I decided to leave them in the main license file. Partly because this is a file that lots of companies include in their products and I have had some use of seeing the year ranges in there in the past!

Bliss

Now we can forget about copyright years in the project. It’s a relief!

Considering C99 for curl

tldr: we stick to C89 for now.

The curl project builds on foundations that started in late 1996 with the tool named httpget.

ANSI C became known as C89

In 1996 there were not too many good alternatives for making a small and efficient command line tool for doing Internet transfers. I am not saying that C was the only available language, but for me the choice was easy and frankly I did not even think about any other languages when this journey started. We called the C flavor “ANSI C” back then, as compared to the K&R “old style” C. The ANSI C version would later be renamed to C89 (confusingly enough it is also sometimes known as C90).

In the year 2000 we introduced libcurl, the library that provides Internet transfer super powers to whoever wants it. This made the choice of using C even better. C made it possible for us to provide a stable API/ABI without problems – something not even C++ could offer at the time. It was also a reasonably portable language that made it possible for us to bring curl and libcurl to virtually all modern operating systems.

As I wanted curl and libcurl to be system level options and I aimed for the widest possible adoption, they could not be written in any of the higher level languages like Perl, Python or similar. That would make them too big and require too much “extra baggage”.

I am convinced that the use of (conservative) C for curl is a key factor to its success and its ability to get used “everywhere”.

C99

C99 was published in (surprise!) 1999 but the adoption in compilers took a long time and it remained a blocker for adoption for us. We want curl available “everywhere” so as long some of the major compilers did not support C99 we did not even consider switching C flavor, as it would risk hamper curl adoption.

The slowest of the “big compilers” to adopt C99 was the Microsoft Visual C++ compiler, which did not adopt it properly until 2015 and added more compliance in 2019. A large number of our users/developers are still stuck on older MSVC versions so not even all users of this compiler suite can build C99 programs even today, in late 2022.

C11, C17 and beyond

Meanwhile, the ISO C Working Group continue to crank out updates to the C language. C11 shipped, C17 came and now they are working on the C2x pending version, presumed to end up called C23.

Bump the requirement for curl?

We are aware that other widely popular C projects are moving forward and have raised their requirements to C99 or beyond. Like the Linux kernel, the git project and more.

The discussion about bumping C flavor has been brought up on the libcurl mailing list as well, in particular as we are already planning a version 8 release to happen in the spring of 2023 so in theory it could be a good moment to make some changes like this.

What C99 features would improve a project like curl? The most interesting parts of C99 that could impact curl code that I could think of are:

  • // comments
  • __func__ predefined identifer
  • boolean type in <stdbool.h>
  • designated struct initializers
  • empty macro arguments
  • extended integer types in <inttypes.h> and <stdint.h>
  • flexible array members (zero size arrays)
  • inline functions
  • integer constant type rules
  • mixed declarations and code
  • the long long type and library functions
  • the snprintf() family of functions
  • trailing comma allowed in enum declaration
  • vararg macros
  • variable-length arrays

So sure, there are lots of cool things we could use. But do we need them?

For several of the features above, we already have decent and functional replacements. Several of the features don’t matter. The rest risk becoming distractions.

Opening up for C99 without conditions in curl code would risk opening the flood gates for people rewriting things, so we would have to go gently and open up for allowing new C99 features slowly. That is also how the git project does it. A challenge with that approach, is that it is hard to verify which features that are allowed vs used as existing tooling normally don’t have that resolution.

The question has also been asked that if we consider bumping the requirement, should we then not bump it to C11 at once instead of staying at C99?

Not now

Ultimately, not a single person has yet been able to clearly articulate what benefits such a C flavor requirement bump would provide for the curl project. We mostly see a risk that we all get caught in rather irrelevant discussions and changes that perhaps will not actually bring the project forward very much. Neither in features nor in quality/security.

I think there are still much better things to do and much more worthwhile efforts to spend our energy on that could actually improve the project and bring it forward.

Like improving the test suite, increasing test coverage, making sure more code is exercised by the fuzzers.

A minor requirement change

We have decided that starting with curl 8, we will require that the compiler supports a 64 bit data type. This is not something that existed in the original C89 version but was introduced in C99. However, there is no longer any modern compiler around that does not support this.

This is a way to allow us to stop caring about those odd platforms and write code and checks for when the large types are not very large. It is hard to verify that code nowadays since virtually nobody actually uses such compilers/systems.

Maybe this is the way we can continue to adapt to and use specific post C89 features going forward. By cherry-picking them one by one and adapting to them slowly over time.

It is not a no to C99 forever

I am sure we will bring up this topic for discussion again in the future. We have not closed the door forever or written anything in stone. We have only decided that for the moment we have not been persuaded to switch. Maybe we will in a future.

Other languages

We do not consider switching or rewriting curl into any other language.

Discussion

See reddit and hacker news.

connection filters in curl

In the curl project, one of the holiest and most sacred rules is:

we do not break the API or ABI

Everything else is a matter of discussion.

More features all the time

We keep adding features and we do improvements at a rather high pace. So much that we actually rarely do a release without introducing something new.

To be able to add features and to keep changing curl and making sure that it keeps up with the world around it and that it provides the features and the abilities that a world of Internet transfers needs, we need to make sure that the internals are written correctly. And by correctly, I mean in a way that allows us to extend and change curl when we want to – that doesn’t break the ABIs nor the tests.

Refactors

curl is old and choices sometimes need to be reconsidered. Over the years we have refactored and changed the curl internals and design quite drastically several times. Thanks to an extensive test suite and a library API that was designed from the start to hide most internal choices, this has been possible to do without being visible to users. The upside has been that the internals have become easier to maintain and to extend with more features.

Refactoring again

This time, we are again on a mission to extend the curl feature set as I blogged about recently, and this time we have Stefan Eissing on board to do it.

So, without changing any API or breaking the ABI and having the large set of test cases remain working in the many CI jobs we have, Stefan introduced a new internal concept for curl: connection filters.

Filters

We call them filters but they could also be seen as layers or maybe even domino pieces. Each filter is a piece of network logic and the idea is that we can chain them together at run-time to create protocol cakes (my word). curl can connect to a HTTP proxy, do TLS and speak HTTP/2 over that. That makes three separate filters put together.

Adding for example TLS to the proxy would just be inserting a filter in the right place in the chain, while using the filter-chain is done the same way no matter the filter chain length and independently of which exact filters it consists of.

The previous logic, before the filters, was a more like a vast number of conditional flag checks done in the right order. This new system reduces the amount of conditional checks and it also moves code for handling the different network filters into more localized and compartmentalized functions.

More protocol combos

In addition to the more localized code for specifics features, this new concept more notably makes it easier to build new protocol layer combinations. Adding support for HTTP/2 to the proxy for example, should now ideally be a matter of adding a filter the right way and the transfer pipeline should otherwise “just work”.

Not everything internally is yet converted to filters even if we have merged the first large pull request. Stefan now works on getting more curl code to use this concept before he can get into the actual protocol changes lined up for him.

Performance

The filters do not impact transfer performance, I/O works the same as before.

Details

If you long for more technical details and explanations about this, maybe to be able to dig into the curl source code yourself, then an excellent starting-point is the document in the curl source made for this purpose CONNECTION-FILTERS.md.

What goes into curl?

curl is a command line tool and library for doing Internet data transfers. It has been around for a loooong time (over 23 years) but there is still a flood of new things being added to it and development being made, to take it further and to keep it relevant today and in the future.

I’m the lead developer and head maintainer of the curl project.

How do we decide what goes into curl? And perhaps more importantly, what does not get accepted into curl?

Let’s look how this works in the curl factory.

Stick to our principles

curl has come this far by being reliable, trusted and familiar. We don’t rock the boat: curl does Internet transfers specified as URLs and it doesn’t parse or understand the content it transfers. That goes for libcurl too.

Whatever we add should stick to these constraints and core principles, at least. Then of course there are more things to consider.

A shortlist of things I personally want to see

I personally usually have a shortlist of a few features I personally want to work on in the coming months and maybe half year. Items I can grab when other things are slow or if I need a change or fun thing to work on a rainy day. These items are laid out in the ROADMAP document – which also tends to be updated a little too infrequently…

There’s also the TODO document that lists things we consider could be good to do and KNOWN_BUGS that lists known shortcomings we want to address.

Sponsored works have priority

I’m the lead developer of the curl project but I also offer commercial support and curl services to allow me to work on curl full-time. This means that paying customers can get a “priority lane” into landing new features or bug-fixes in future releases of curl. They still need to suit the project though, we don’t abandon our principles even for money. (Contact me to learn how I can help you get your proposed changes done!)

Keep up with where the world goes

All changes and improvements that help curl keep up with and follow where the Internet protocol community is moving, are considered good and necessary changes. The curl project has always been on the front-lines of protocols and that is where we want to remain. It takes a serious effort.

Asking the community

Every year around the May time frame we do a “user survey” that we try to get as many users as possible to respond to. It asks about user patterns, what’s missing and how things are working.

The results from that work provide good feedback on areas to improve and help us identify features our community think curl lacks etc. (The 2020 survey analysis)

Even outside of the annual survey, discussions on the mailing list is a good way for getting direct feedback on questions and ideas and users very often bring up their ideas and suggestions using those channels.

Ideas are easy, code is harder

Actually implementing and providing a feature is a lot harder than just providing an idea. We almost drown among all the good ideas people propose we might or could do one day. What someone might think is a great idea may therefore still not be implemented very soon. Because of the complexity of implementing it or because of lack of time or energy etc.

But at the same time: oftentimes, someone needs to bring the idea or crack the suggestion for it to happen.

It needs to exist to be considered

Related to the previous section. Code and changes that exist, that are provided are of course much more likely to actually end up in curl than abstract ideas. If a pull-request comes to curl and the change adheres to our standards and meet the requirements mentioned in this post, then the chances are very good that it will be accepted and merged.

As I am currently the only one working on curl professionally (ie I get paid to do it). I can rarely count on or assume work submissions from other team members. They usually show up more or less by surprise, which of course is awesome in itself but also makes such work and features very hard to plan for ahead of time. Sometimes people bring new features. Then we deal with them!

Half-baked is not good enough

A decent amount of all pull requests submitted to the project never get merged because they aren’t good enough and the person who submitted them doesn’t respond to feedback and improvement requests properly so that they never become good enough. Things like documentation and tests are typically just as important as the functionality itself.

Pull requests that are abandoned by the author can of course also get taken over by someone else but it cannot be expected or relied upon. A person giving up on the pull request is also a strong sign to the rest of us that obviously the desire to get that specific change landed wasn’t that big and that tells us something.

We don’t accept and merge partial changes that for example lack a crucial part like tests or documentation because we’ve learned the hard way many times over the years that it is just too common that the author then vanishes before completing the work – forcing others to do that work or we have to rip the change out again.

Standards and in-use are preferred properties

At times people suggest we support new protocols or experiments for new things. While that can be considered fun and useful, we typically want both the protocol and the associated URL syntax to already be in use and be somewhat established and preferably even standardized and properly documented in specifications. One of the fundamental core ideas with URLs is that they should mean the same thing for more than one application.

When no compass needle exists, maintain existing direction

Most changes are in line with what we already do and how the products work so no major considerations are necessary. Only once in a while do we get requests or suggestions that actually challenge the direction or forces us to consider what is the right and the wrong way.

If the reason and motivation provided is valid and holds up, then we might agree and go in that direction, If we don’t, we discuss the topic and see if we perhaps can change someone’s mind or “wiggle” the concepts and ideas to see whether we can change the suggestion or perhaps see it from n a different angle to reconsider. Sometimes we just have to decline and say no: that’s not something we think is in line with curl.

Who decides if its fine?

curl is not a democracy, we don’t vote about decisions or what to accept etc.

curl is also not a strict dictatorship where a single leader dictates all truths and facts from above for all subjects to accept and obey.

We’re somewhere in between. We discuss and try to find consensus of what and how to do things. The persons who bring the code or experience the actual problems of course will have more to say. Experienced and long-term maintainers’ opinions have more weight in discussions and they’re free and allowed to merge pull-requests they think are good.

I retain the right to veto stuff, but I very rarely exercise that right.

curl is still a small project. You’ll notice that you’ll quickly recognize the same handful of maintainers in all pull-requests and long tail of others chipping in here and there. There’s no massive crowd anywhere. That’s also the explanation why sometimes your pull-requests might not get reviewed instantly but you must rather wait for a while until you get someone’s attention.

If you’re curious to learn how the project is governed in more detail, then check out the governance docs.

How to land code in curl

I’ve done a previous presentation on how to work with the project get your code landed in curl. Check it out!

Your feedback helps!

Listening to what users want, miss and think are needed when going forward is very important to us. Even if it sometimes is hard to react immediately and often we have to bounce things a little back and forth before they can become “curl material”. So, please don’t expect us to immediately implement what you suggest, but please don’t let that stop you from bringing your grand ideas.

And bring your code. We love your code.

I am an 80 column purist

I write and prefer code that fits within 80 columns in curl and other projects – and there are reasons for it. I’m a little bored by the people who respond and say that they have 400 inch monitors already and they can use them.

I too have multiple large high resolution screens – but writing wide code is still a bad idea! So I decided I’ll write down my reasoning once and for all!

Narrower is easier to read

There’s a reason newspapers and magazines have used narrow texts for centuries and in fact even books aren’t using long lines. For most humans, it is simply easier on the eyes and brain to read texts that aren’t using really long lines. This has been known for a very long time.

Easy-to-read code is easier to follow and understand which leads to fewer bugs and faster debugging.

Side-by-side works better

I never run windows full sized on my screens for anything except watching movies. I frequently have two or more editor windows next to each other, sometimes also with one or two extra terminal/debugger windows next to those. To make this feasible and still have the code readable, it needs to fit “wrapless” in those windows.

Sometimes reading a code diff is easier side-by-side and then too it is important that the two can fit next to each other nicely.

Better diffs

Having code grow vertically rather than horizontally is beneficial for diff, git and other tools that work on changes to files. It reduces the risk of merge conflicts and it makes the merge conflicts that still happen easier to deal with.

It encourages shorter names

A side effect by strictly not allowing anything beyond column 80 is that it becomes really hard to use those terribly annoying 30+ letters java-style names on functions and identifiers. A function name, and especially local ones, should be short. Having long names make them really hard to read and makes it really hard to spot the difference between the other functions with similarly long names where just a sub-word within is changed.

I know especially Java people object to this as they’re trained in a different culture and say that a method name should rather include a lot of details of the functionality “to help the user”, but to me that’s a weak argument as all non-trivial functions will have more functionality than what can be expressed in the name and thus the user needs to know how the function works anyway.

I don’t mean 2-letter names. I mean long enough to make sense but not be ridiculous lengths. Usually within 15 letters or so.

Just a few spaces per indent level

To make this work, and yet allow a few indent levels, the code basically have to have small indent-levels, so I prefer to have it set to two spaces per level.

Many indent levels is wrong anyway

If you do a lot of indent levels it gets really hard to write code that still fits within the 80 column limit. That’s a subtle way to suggest that you should not write functions that needs or uses that many indent levels. It should then rather be split out into multiple smaller functions, where then each function won’t need that many levels!

Why exactly 80?

Once upon the time it was of course because terminals had that limit and these days the exact number 80 is not a must. I just happen to think that the limit has worked fine in the past and I haven’t found any compelling reason to change it since.

It also has to be a hard and fixed limit as if we allow a few places to go beyond the limit we end up on a slippery slope and code slowly grow wider over time – I’ve seen it happen in many projects with “soft enforcement” on code column limits.

Enforced by a tool

In curl, we have ‘checksrc’ which will yell errors at any user trying to build code with a too long line present. This is good because then we don’t have to “waste” human efforts to point this out to contributors who offer pull requests. The tool will point out such mistakes with ruthless accuracy.

Credits

Image by piotr kurpaska from Pixabay

keep finding old security problems

I decided to look closer at security problems and the age of the reported issues in the curl project.

One theory I had when I started to collect this data, was that we actually get security problems reported earlier and earlier over time. That bugs would be around in public release for shorter periods of time nowadays than what they did in the past.

My thinking would go like this: Logically, bugs that have been around for a long time have had a long time to get caught. The more eyes we’ve had on the code, the fewer old bugs should be left and going forward we should more often catch more recently added bugs.

The time from a bug’s introduction into the code until the day we get a security report about it, should logically decrease over time.

What if it doesn’t?

First, let’s take a look at the data at hand. In the curl project we have so far reported in total 68 security problems over the project’s life time. The first 4 were not recorded correctly so I’ll discard them from my data here, leaving 64 issues to check out.

The graph below shows the time distribution. The all time leader so far is the issue reported to us on March 10 this year (2017), which was present in the code since the version 6.5 release done on March 13 2000. 6,206 days, just three days away from 17 whole years.

There are no less than twelve additional issues that lingered from more than 5,000 days until reported. Only 20 (31%) of the reported issues had been public for less than 1,000 days. The fastest report was reported on the release day: 0 days.

The median time from release to report is a whopping 2541 days.

When we receive a report about a security problem, we want the issue fixed, responsibly announced to the world and ship a new release where the problem is gone. The median time to go through this procedure is 26.5 days, and the distribution looks like this:

What stands out here is the TLS session resumption bypass, which happened because we struggled with understanding it and how to address it properly. Otherwise the numbers look all reasonable to me as we typically do releases at least once every 8 weeks. We rarely ship a release with a known security issue outstanding.

Why are very old issues still found?

I think partly because the tools are gradually improving that aid people these days to find things much better, things that simply wasn’t found very often before. With new tools we can find problems that have been around for a long time.

Every year, the age of the oldest parts of the code get one year older. So the older the project gets, the older bugs can be found, while in the early days there was a smaller share of the code that was really old (if any at all).

What if we instead count age as a percentage of the project’s life time? Using this formula, a bug found at day 100 that was added at day 50 would be 50% but if it was added at day 80 it would be 20%. Maybe this would show a graph where the bars are shrinking over time?

But no. In fact it shows 17 (27%) of them having been present during 80% or more of the project’s life time! The median issue had been in there during 49% of the project’s life time!

It does however make another issue the worst offender, as one of the issues had been around during 91% of the project’s life time.

This counts on March 20 1998 being the birth day. Of course we got no reports the first few years since we basically had no users then!

Specific or generic?

Is this pattern something that is specific for the curl project or can we find it in other projects too? I don’t know. I have not seen this kind of data being presented by others and I don’t have the same insight on such details of projects with an enough amount of issues to be interesting.

What can we do to make the bars shrink?

Well, if there are old bugs left to find they won’t shrink, because for every such old security issue that’s still left there will be a tall bar. Hopefully though, by doing more tests, using more tools regularly (fuzzers, analyzers etc) and with more eyeballs on the code, we should iron out our security issues over time. Logically that should lead to a project where newly added security problems are detected sooner rather than later. We just don’t seem to be at that point yet…

Caveat

One fact that skews the numbers is that we are much more likely to record issues as security related these days. A decade ago when we got a report about a segfault or something we would often just consider it bad code and fix it, and neither us maintainers nor the reporter would think much about the potential security impact.

These days we’re at the other end of the spectrum where we people are much faster to jumping to a security issue suspicion or conclusion. Today people report bugs as security issues to a much higher degree than they did in the past. This is basically a good thing though, even if it makes it harder to draw conclusions over time.

Data sources

When you want to repeat the above graphs and verify my numbers:

  • vuln.pm – from the curl web site repository holds security issue meta data
  • releaselog – on the curl web site offers release meta data, even as a CSV download on the bottom of the page
  • report2release.pl – the perl script I used to calculate the report until release periods.

decent durable defect density displayed

Here’s an encouraging graph from our regular Coverity scans of the curl source code, showing that we’ve maintained a fairly low “defect density” over the last two years, staying way below the average density level.
defect density over timeClick the image to view it slightly larger.

Defect density is simply the number of found problems per 1,000 lines of code. As a little (and probably unfair) comparison, right now when curl is flat on 0, Firefox is at 0.47, c-ares at 0.12 and libssh2 at 0.21.

Coverity is still the primary static code analyzer for C code that I’m aware of. None of the flaws Coverity picked up in curl during the last two years were detected by clang-analyzer for example.

Coverity scan defect density: 0.00

A couple of days ago I decided to stop slacking and grab this long dangling item in my TODO list: run the coverity scan on a recent curl build again.

Among the static analyzers, coverity does in fact stand out as the very best one I can use. We run clang-analyzer against curl every night and it hasn’t report any problems at all in a while. This time I got almost 50 new issues reported by Coverity.

To put it shortly, a little less than half of them were issues done on purpose: for example we got several reports on ignored return codes we really don’t care about and there were several reports on dead code for code that are conditionally built on other platforms than the one I used to do this with.

But there were a whole range of legitimate issues. Nothing really major popped up but a range of tiny flaws that were good to polish away and smooth out. Clearly this is an exercise worth repeating every now and then.

End result

21 new curl commits that mention Coverity. Coverity now says “defect density: 0.00” for curl and libcurl since it doesn’t report any more flaws. (That’s the number of flaws found per thousand lines of source code.)

Want to see?

I can’t seem to make all the issues publicly accessible, but if you do want to check them out in person just click over to the curl project page at coverity and “request more access” and I’ll grant you view access, no questions asked.

internally, we’re all multi now!

libcurl internals suddenly become a lot cleaner and neater to work with when we made all code assume and work with the multi interface!

libcurl was initially created slightly after the birth of the curl tool. After the tool started to get some traction and use out in the world, requests and queries about a library with its powers started to drop in. Soon enough, in the year 2000 we shipped the first release of libcurl and it featured a synchronous API (the “easy” interface) that performs the complete operation and then returns. I think we can now say that the blocking easy interface was successful and its ease of use has been very popular and appreciated by many users.

During 2002 the need for a non-blocking API had been identified and we introduced the multi interface. The multi interface is kind of a super-set as it re-uses the same handles as is used with the easy interface, so it cleverly makes it fairly easy for a standard application to move from the easy interface to the multi.

Basically since that day, we’ve struggled in the source code structure to handle the fact that we have both a blocking and a non-blocking API. In lots of places we’ve had different code paths and choices done depending on which API that was used. It made the source code hard to follow and it occasionally introduced hard to track bugs which could lead to the multi and easy interface not behaving the same way to the underlying network or protocol behavior. It was clear very early on that it wasn’t an ideal design choice, but it was a design choice that was spread out among the code and it stuck.

During November 2012 I finally took on the code that we’ve had #ifdef’ed since around 2005 which makes the blocking easy interface operation a wrapper function around the non-blocking multi interface functions. Using this method, all internals should be considered non-blocking and there is no need left to treat things differently depending on which API that was used because everything is now multi interface == non-blocking.

On January 17th 2013 the big patch was committed. 400 added lines, 800 removed over 54 modified files…

cURL

Three static code analyzers compared

I’m a fan of static code analyzing. With the use of fancy scanner tools we can get detailed reports about source code mishaps and quite decently pinpoint what source code that is suspicious and may contain bugs. In the old days we used different lint versions but they were all annoying and very often just puked out far too many warnings and errors to be really useful.

Out of coincidence I ended up getting analyses done (by helpful volunteers) on the curl 7.26.0 source base with three different tools. An excellent opportunity for me to compare them all and to share the outcome and my insights of this with you, my friends. Perhaps I should add that the analyzed code base is 100% pure C89 compatible C code.

Some general observations

First out, each of the three tools detected several issues the other two didn’t spot. I would say this indicates that these tools still have a lot to improve and also that it actually is worth it to run multiple tools against the same source code for extra precaution.

Secondly, the libcurl source code has some known peculiarities that admittedly is hard for static analyzers to figure out and not alert with false positives. For example we have several macros that look like functions and on several platforms and build combinations they evaluate as nothing, which causes dead code to be generated. Another example is that we have several cases of vararg-style functions and these functions are documented to work in ways that the analyzers don’t always figure out (both clang-analyzer and Coverity show problems with these).

Thirdly, the same lesson we knew from the lint days is still true. Tools that generate too many false positives are really hard to work with since going through hundreds of issues that after analyses turn out to be nothing makes your eyes sore and your head hurt.

Fortify

The first report I got was done with Fortify. I had heard about this commercial tool before but I had never seen any results from a run but now I did. The report I got was a PDF containing 629 pages listing 1924 possible issues among the 130,000 lines of code in the project.

fortify-curl

Fortify claimed 843 possible buffer overflows. I quickly got bored trying to find even one that could lead to a problem. It turns out Fortify has a very short attention span and warns very easily on lots of places where a very quick glance by a human tells us there’s nothing to be worried about. Having hundreds and hundreds of these is really tedious and hard to work with.

If we’re kind we call them all false positives. But sometimes it is more than so, some of the alerts are plain bugs like when it warns on a buffer overflow on this line, warning that it may write beyond the buffer. All variables are ‘int’ and as we know sscanf() writes an integer to the passed in variable for each %d instance.

sscanf(ptr, "%d.%d.%d.%d", &int1, &int2, &int3, &int4);

I ended up finding and correcting two flaws detected with Fortify, both were cases where memory allocation failures weren’t handled properly.

LLVM, clang-analyzer

Given the exact same code base, clang-analyzer reported 62 potential issues. clang is an awesome and free tool. It really stands out in the way it clearly and very descriptive explains exactly how the code is executed and which code paths that are selected when it reaches the passage is thinks might be problematic.

clang-analyzer report - click for larger version

The reports from clang-analyzer are in HTML and there’s a single file for each issue and it generates a nice looking source code with embedded comments about which flow that was followed all the way down to the problem. A little snippet from a genuine issue in the curl code is shown in the screenshot I include above.

Coverity

Given the exact same code base, Coverity detected and reported 118 issues. In this case I got the report from a friend as a text file, which I’m sure is just one output version. Similar to Fortify, this is a proprietary tool.

coverity curl report - click for larger version

As you can see in the example screenshot, it does provide a rather fancy and descriptive analysis of the exact the code flow that leads to the problem it suggests exist in the code. The function referenced in this shot is a very large function with a state-machine featuring many states.

Out of the 118 issues, many of them were actually the same error but with different code paths leading to them. The report made me fix at least 4 accurate problems but they will probably silence over 20 warnings.

Coverity runs scans on open source code regularly, as I’ve mentioned before, including curl so I’ve appreciated their tool before as well.

Conclusion

From this test of a single source base, I rank them in this order:

  1. Coverity – very accurate reports and few false positives
  2. clang-analyzer – awesome reports, missed slightly too many issues and reported slightly too many false positives
  3. Fortify – the good parts drown in all those hundreds of false positives