A little over a year ago, we ditched Travis CI as a service to use for the curl project.
Up until that day, it had been our preferred and favored CI service for many years. At most, we ran 34 CI jobs on it, for every pull-request and commit. It was the service that we leaned on when we transitioned the curl project into a CI-heavy user. Our use of CI really took off 2017 and has been increasing ever since.
A clean cut
We abruptly cut off over 30 jobs from the service just one day. At the time, that was a third of all our CI. The CI jobs that we rely on to verify our work and to keep things working and stable in the project.
More CI services
At the time of the amputation, we run 99 CI jobs distributed over 5 services, so even with one of them cut off we still ran jobs on AppVeyor, Azure Pipelines, GitHub Actions and Cirrus CI. We were not completely stranded.
New friends
Luckily for us, when one solution goes sour there are often alternatives out there that we can move over to and continue our never-ending path forward.
In our case, friendly people helped moving over almost all ex-Travis jobs to the (for us) new service Zuul CI. In July 2021, we had 29 jobs on it. We also added Circle CI to the mix and started running jobs there.
In July 2021 – a month after the cut – we counted 96 running jobs (a few old jobs were just dropped as we reconsidered their value). While the work involved a lot of adjusting scripts, pulling hair over yaml files and more, it did not cause any significant service loss over an extended period. We managed pretty good.
There was no noticeable glitch in quality or backed up “guilt” in the project because of the transition and small period of lesser CI either. Thanks to the other services still running, we were still in a good shape.
Why all the services
In the end of August 2022 we still use 6 different CI services and we now run 113 CI jobs on them, for every push to master and to pull-requests.
There are primarily three reasons why we still use a variety of services.
- load balancing: we get more parallelism by running jobs on many services as they all have a limited parallelism per service.
- We also get less problems when one of the services has some glitch or downtime, as then we still work with the others. The not all eggs in the same basket thing.
- The various services also have different features, offer different platforms and work slightly differently which for several jobs make them necessary to run on a specific service, or rather they cannot run on most of the other services.
It does not end
Over the year since the amputation, we have learned that our new friend Zuul CI has turned out to not work quite as reliable and convenient as we would like it. Since a few months back, we are now gradually moving away from this service. Slowly moving over jobs from there to run on one the other five instead.
Over time, our new most preferred CI service has turned out to be GitHub Actions. At the latest count, it now run 44 CI jobs for us. We still have 12 jobs on Zuul targeted for transition.
Services come and go. We have different ideas and our requirements and ambitions change. I am sure we will continue to service-jump when needed. It is just a natural development. A part of a software development life.
On flakiness
A big challenge and hurdle with our CI setup remains: to maintain the builds and keep them stable and functioning. With over a hundred jobs running on six services and our code and test suite being portable and things being networked and running on many platforms, it is job that we quite often fail at. It has turned out mighty difficult to avoid that at least a few of the jobs are constantly red, “permafailing”, at any single moment.
If this is stuff you like to tinker with, we could use your help!