Solving the Pain that is CI/CD

Last week, The Pain That Is GitHub Actions was on the front page of Hacker News. It received 700 upvotes and 500+ comments. The pain is widespread, and opinions on how to solve it are highly varied.

I’ve spent 2+ years working on solving these problems. Based on my experience, I think the solutions are different from what most engineers propose.

Running Locally

The biggest pain point in the article stems from the inefficient development process when working on CI pipelines. Having to commit and push and wait for jobs to be picked up on remote infrastructure creates a painfully slow feedback loop.

The presumed solution to this problem is being able to run pipelines locally. Clearly, that would eliminate the need to commit and push and wait on remote infrastructure.

However, that's not the crux of the problem. And it shouldn’t be necessary – for example, engineers don’t need to be able to run a fly.io clone locally to productively implement automation around fly.io. Ironically, building an entire cloud infrastructure stack that can also be run locally requires certain constraints and tradeoffs which would make other aspects of the experience worse.

The problem with the feedback loop isn't that you can't run the entire CI/CD stack locally. It's simply that the time from making a change to knowing whether that change worked is far too slow. It also just feels bad to commit and push code which ends up being broken, including for things as simple as syntax errors.

Solving this problem is not hard, but it does require a few fundamental changes.

Local CLI

Rather than being limited to only running by committing and pushing, CI platforms should also support starting a run from a local CLI using local run definitions. Decoupling CI runs from version control webhooks is the first step in accelerating the feedback loop.

Using a local CLI is easy enough, but additionally, the run definitions themselves need to be decoupled from data that comes from version control events. Right now, the major CI platforms require using data attributes from push or pull request events.

This coupling is also easy to solve. If you define runs with an interface which allows parameters to be passed in, then you can configure those parameters in version control hooks while also passing them in when running in other contexts, such as via a local CLI.

Remote Debugger

The CI/CD feedback loop is also too slow due to setup steps being repeated in each run. As you begin to make changes further downstream in a workflow, the feedback loop lengthens. Often, you're repeating the steps to install system packages, clone a git repository, install a language runtime, run a package manager, setup a database, etc. before you get to the point where your changes execute.

Often this is slow enough that you start context switching onto other things between runs, which then slows down the overall time to identify the solution even more.

This problem can be solved with a remote debugger. Without a debugger, you have to add debugging statements to a run definition or guess at fixes only to realize you still don't have it quite right while on a several minute long feedback loop.

However, if you can set a breakpoint which opens an interactive SSH session when hit, you can solve problems much faster. You only pay the setup cost once and can pause the execution and connect to the remote machine to figure out the right command to run. The feedback loop is reduced from minutes to seconds.

Automatic Caching

While a local CLI along with a remote debugger solves the majority of the problems, we can take the solution one step further. Often when iterating, CI platforms are running a ton of duplicate executions. They're running the exact same command on the exact same files as has previously been executed. This is a problem for CI in general, but it's especially annoying when working on the pipeline definitions themselves.

The solution to this is automatic, content-based caching. If the same command is executed on the exact same files, CI platforms should cache the entire execution. This technique is popular in build tools like Bazel, but it can be achieved with a much simpler interface than Bazel requires.

If CI platforms offered this functionality, it'd greatly speed up the development feedback loop. Even without using the remote debugger, when iterating on a step several minutes into a pipeline, CI runs would produce cache hits up until the point of the change rather than having to re-execute previous steps.

YAML

Another common complaint for CI workflows is YAML configuration. The presumed solution to this problem is avoiding YAML. However, this problem has been misdiagnosed.

Effectively, the specification for CI pipelines is going to be something that wraps shell scripts. The question then is what's the best way to wrap them?

The key is to minimize and simplify the logic and declarations that are outside of the bash scripts. Ironically, some of the attempts from tools to solve this problem by moving away from YAML make the underlying problem worse. Writing pipeline definitions in specialized SDKs encourages more complexity outside of the shell scripts.

Essentially, the problem is not YAML itself. YAML is simply a format for serializing a data structure. The problem is the overly proprietary interface for defining CI pipelines and lack of elegant ways to handle complex workflows. Convoluted syntax and untestable expressions aren't YAML's fault – it's just bad API design.

Bash scripts defined in simple data structures serialized as YAML is the way to go. It's easy to write, and it's easy to read. The key is to keep the specification as simple as possible. CI workflows start to get more complex when the interface encourages specialized ways of doing things. If instead, the interface encourages running shell scripts the same way that they're run in any type of environment, then the implementation will be more understandable, more portable, and more reusable in other non-CI contexts.

Bash is the lingua franca of systems automation, and Bash in YAML can work well.

Running Containers on CI

The API design problem is evident in the major CI systems when looking at the interface for running containers. The best approach to running containers is to run them the same way that you would normally, whether it's Docker, Podman, or whatever you want.

You don't need a completely different syntax and interface for running services. CI platforms should just make it easy to run `docker compose up`

Proprietary Expressions

YAML also has a bad reputation stemming from scenarios where engineers shove a bunch of complexity into it. As workflows start to become more complex, most platforms support embedding expressions directly into the YAML. A little bit of logic is understandable, but you should avoid starting to intersperse lots of code directly into the data structure that is your pipeline definition.

The solution to avoid this complexity is supporting dynamic definitions. If a CI run can generate more steps or jobs on the fly, then you can use a regular programming language to generate those additional jobs. Which means that you can more easily test the logic, execute the script locally to see if it's generating the desired output, etc.

Rather than putting code into your YAML via expressions, you should use code to generate the YAML (or JSON if you prefer, since YAML is a superset of JSON). It’s easier to test code that generates YAML than it is to test YAML with embedded expressions.

Notably, you also don’t need specialized SDKs to do this. A simple data structure for run definitions means that it’s easy to generate it from a script with few dependencies.

Supply Chain Security

The article also mentioned the recent incident of a third-party GitHub Action being maliciously updated.

Supply chain security for software cuts two ways – you don't want your CI pipelines pulling in the latest version of dependencies on every run, because a compromised package will instantly make its way into your pipeline before anybody has a chance to intervene and notice. It’s also a bad idea for reliability.

However, you also don't want to permanently pin to known versions of packages, because then you also won't ever get security fixes.

The solution to this problem is to use semver with lock files. Lock files ensure that updates aren't going to be applied without your involvement. You may still choose to update dependencies on a cron schedule with a tool like dependabot, but they're not going to change out from underneath you.

In addition to being helpful for security, a manual update also gives you an opportunity to ensure that updates to third-party packages are not going to suddenly break your workflows since they’ll need to be run through a pull request before being applied.

Merge Queue

The article also mentioned pain from configuring status checks when using a Merge Queue with GitHub Actions. This is more of a problem with GitHub's branch protection design than with CI/CD directly, but I did want to mention that content-based caching is also amazing with merge queues. If a pull request is up-to-date before being merged, then the CI platform is going to be running the exact same commands on the exact same files. That execution should be able to be cached and CI should finish within seconds.

Solutions

Solving these problems results in a substantially improved developer experience in CI/CD. I’m certain of this because the RWX CI/CD platform does it. I'm happy to chat anytime. Reach out at dan@rwx.com, find me on the RWX Discord, or video chat with me.