How randomly skipping tests made them better!

In the curl project we produce and ship a rock solid and reliable library for the masses, we must never exit, leak memory or do anything in an ungraceful manner. We must free all resources and error out nicely whatever problem we run into at whatever moment in the process.

To help us stay true to this, we have a way of testing we call “torture tests”. They’re very effective error path tests. They work like this:

Torture tests

They require that the code is built with a “debug” option.

The debug option adds wrapper functions for a lot of common functions that allocate and free resources, such as malloc, fopen, socket etc. Fallible functions provided by the system curl runs on.

Each such wrapper function logs what it does and can optionally either work just like it normally does or if instructed, return an error.

When running a torture test, the complete individual test case is first run once, the fallible function log is analyzed to count how many fallible functions this specific test case invoked. Then the script reruns that same test case that number of times and for each iteration it makes another of the fallible functions return error.

First make function 1 return an error. Then make function 2 return and error. Then 3, 4, 5 etc all the way through to the total number. Right now, a typical test case uses between 100 and 200 such function calls but some have magnitudes more.

The test script that iterates over these failure points also verifies that none of these invokes cause a memory leak or a crash.

Very slow

Running many torture tests takes a long time.

This test method is really effective and finds a lot of issues, but as we have thousands of tests and this iterative approach basically means they all need to run a few hundred times each, completing a full torture test round takes many hours even on the fastest of machines.

In the CI, most systems don’t allow jobs to run more than an hour.

The net result: the CI jobs only run torture tests on a few selected test cases and virtually no human ever runs the full torture test round due to lack of patience. So most test cases end up never getting “tortured” and therefore we miss out verifying error paths even though we can and we have the tests for it!

But what if…

It struck me that when running these torture tests on a large amount of tests, a lot of error paths are actually identical to error paths that were already tested and will just be tested again and again in subsequent tests.

If I could identify the full code paths that were already tested, we wouldn’t have to test them again. But getting that knowledge would require insights that our test script just doesn’t have and it will be really hard to make portable to even a fraction of the platforms we run and test curl on. Not the most feasible idea.

I went with something much simpler.

I simply estimate that most test cases actually have many code paths in common with other test cases. By randomly skipping a few iterations on each test, those skipped code paths might still very well be tested in another test. As long as the skipping is random and we do a large number of tests, chances are we cover most paths anyway. I say most because it certainly will not be all.

Random skips

In my first shot at this (after I had landed to change that allows me to control the torture tests this way) I limited the number of errors to 40 per test case. Now suddenly the CI machines can actually blaze through the test cases at a much higher speed and as a result, they ran torture tests on tests we hadn’t tortured in a long time.

I call this option to the runtests.pl script --shallow.

Already on this my first attempt in doing this, I struck gold and the script highlighted code paths that would create memory leaks or even crashes!

As a direct result of more test cases being tortured, I found and fixed nine independent bugs in curl already before I could land this in the master branch, and there seems to be more failures that pop up after the merge too! The randomness involved may of course delay the detection of some problems.

Room for polishing

The test script right now uses a fixed random seed so that repeated invokes will make it work exactly the same. Which is good when you want to reproduce it elsewhere. It is bad in the way that each test will have the exact same tests skipped every test round – as long as the set of fallible functions are unmodified.

The seed can be set by a command line argument so I imagine a future improvement would be to set the random seed based on the git commit hash at the point where the tests are run, or something. That way, torture tests on subsequent commits would get a different random spread.

Alternatively, I will let the CI systems use a true random seed to make it test a different set every time independent of git etc – as when it detects an error the informational output will still be enough for a user to reproduce the problem without the need of a seed.

Further, I’ve started out running --shallow=40 (on the Ubuntu version) which is highly unscientific and arbitrary. I will experiment altering this amount both up and down a bit to see what I learn from that.

Torture via strace?

Another idea that’s been brewing in my head for a while but I haven’t yet actually attempted to do this.

The next level of torture testing is probably to run the tests with strace and use its error injection ability, as then we don’t even need to build a debug version of our code and we don’t need to write wrapper code etc.

Credits

Dice image by Erik Stein from Pixabay

14 thoughts on “How randomly skipping tests made them better!”

  1. Would you entertain full tests between random testing? E.g. full, random, random,full,random,random.

    I use curl on a daily basis, thank you for all your hard work!

    1. @BP: as mentioned above, the problem with doing full torture tests is that they are tremendously slow so they don’t really work in the CI and I think the tests work absolutely best when run in there.

      We could possibly have some autobuilds run the full torture tests in a cronjob, but that makes us have to notice “out of commit”, so I rather think that if we just make sure to alter the random seed so that we will cover tests over time is a more worthwhile effort.

      Of course we can also do both. I certainly won’t stop anyone from running more tests for us! You can too!

      1. Until a couple of years ago I used to run full torture autobuilds weekly. I had to split the test runs into two, since the first one running all but about the 30 slowest tests took all night, while the second running another dozen or so of those slowest tests took all of the next night. I couldn’t run the remaining slow tests at all since they wouldn’t finish before the next nightly snapshot was released.

  2. Do you save the seed for a run that fails, so you can try the same run (with the same seed) after an fix has been applied?

    1. @Björn: yes I will. Right now it runs with the same seed all the time so it is easy. Also, when one particular error path test fails, I typically run them all locally when debugging and verify they all work when I do a fix.

      1. Would it be feasible to generate a seed through an external process (something like /dev/urandom), log it, then use in the place of the fixed seed you’re currently using?

        You’d get a more random spread of skipped tests, without losing repeatability.

        Also: thanks for your work on curl. Its literally part of my daily workflow.

        1. @Morgen: yes, that would be entirely feasible and easy to do. The thing that has made me not do that (yet) is that will make re-runs of the CI for the same pull-request run with a new seed and thus run different tests and may then miss that single one that caused a problem in the previous run in the same PR…

          Maybe I should rather base the seed on the current date or something? or name of the git branch,,,

  3. This reminds me of property based testing, where you state some invariant and then test it on a random sampling of inputs.

  4. Thanks for this article. I completely believe in this approach to test coverage and recommended it to my team. Reproducible randomness is also important so you do want the whole system run from a single seed (per thread if distributed)

  5. Random _ordering_ of tests may also be useful, if you’re not already doing it… catching cases where one test is working because it’s unintentionally inheriting setup from an earlier one.

    1. @Simon: that’s indeed very useful and we have an option for the main test runner script to do that, but we don’t use that in our CI builds atm – mostly because I’m not entirely sure there isn’t some lingering order-assumptions in there… I suppose I should just try it and work out the kinks!

      1. Yes, and of course, unpredictable test failures are the worst to try and fix. But if you’re capturing the random seed for each run, you can presumably also re-run the same sequence to see if the failure is consistent for that seed.

        One thing, though. You’re avoiding it because you’re not sure about lingering order assumptions — but aren’t those assumptions already broken by the random element you’ve already introduced, if some cases are being randomly skipped?

Comments are closed.