· 7 min read ·

How Coverage Maps Beat Build Graphs for Selective CI in a Dynamic Language

Source: lobsters

Stripe’s engineering team recently published a detailed breakdown of how they cut CI times for their 50-million-line Ruby monorepo by running only the tests that matter for a given change. The approach is not novel in concept, but the execution at that scale, in a dynamic language, with correctness as a hard constraint, surfaces tradeoffs that most write-ups on this subject gloss over.

This post is less about what Stripe built and more about the technical problem underneath it: how you figure out which tests to skip when your language doesn’t give you a static dependency graph, and what you give up to get there.

The Core Problem with Dynamic Languages

In a statically compiled language with explicit imports, building a dependency graph is tractable. A Go binary’s dependency tree is fully knowable from source alone. A Rust crate’s use declarations, a Java module’s import statements, a TypeScript file’s import from lines, all of these give a build tool like Bazel or Buck2 enough information to declare, at the file level, which source units a test depends on.

Ruby does not cooperate with this model. require can take a string expression. Autoloaders like Zeitwerk map constants to files based on naming conventions, resolving them lazily at runtime. Metaprogramming can define methods and classes that only exist after some other code runs. Even with Sorbet, Stripe’s own gradual type system for Ruby, the call graph is incomplete for untyped code, and the dependency graph derived from it carries uncertainty.

This is the fundamental reason Stripe, and most teams with large Ruby codebases, reach for coverage-based test impact analysis rather than a build graph. You cannot reliably derive what a test touches from reading it statically. But you can record what it actually touched when it ran.

How Coverage-Based Mapping Works

Ruby’s standard library includes a Coverage module that instruments loaded files and records execution. You enable it before loading your test framework, run a test, and collect the result:

Coverage.start(lines: true)

# ... load and run a single test file ...

result = Coverage.result
# => {
#   "/app/lib/payment.rb" => [nil, 3, 1, 0, nil, 2, ...],
#   "/app/lib/charge.rb"  => [nil, nil, 1, 1, nil, ...],
#   ...
# }

Each key is a source file path. Each value is an array aligned with the file’s lines: nil for non-executable lines, an integer for the execution count. For the purposes of test selection, you don’t need line-level precision. You need only the set of files touched:

touched_files = result
  .select { |_, lines| lines.any? { |count| count&.positive? } }
  .keys

Run this across every test file in the suite, aggregate the results, and you get an inverted index: a map from source file to the set of test files that covered it. Store that map. On each subsequent PR, diff against the base branch, find the changed source files, look them up in the map, and run the union of their covering test sets.

The algorithm is simple. The cost is in the data pipeline around it.

What It Costs to Build the Map

Coverage instrumentation in Ruby has measurable overhead. The MRI interpreter, even with the relatively low-cost lines: true mode rather than full branch or method tracking, can slow test execution by 20-40% depending on the suite. For Stripe’s scale, collecting fresh coverage across the full suite means paying that overhead on a job that was already expensive.

This is why coverage maps are typically not rebuilt on every commit to main. The practical pattern is a scheduled job, often nightly, that runs the full suite with coverage enabled, computes the inverted index, serializes it (commonly as a compressed JSON or MessagePack blob), and writes it to a shared store that CI workers can fetch at the start of each PR build.

The tradeoff is that the map is always slightly stale. A file refactored this morning might have a different test coverage profile than what the nightly map reflects. This is the central correctness concern.

The Staleness Problem

A stale coverage map can cause two categories of error. The first is a false positive: a test gets included that no longer actually covers the changed file. This is safe but wasteful. The second is a false negative: a test that does cover the changed file is missing from the map because the dependency was introduced after the map was built. This is the failure mode that matters.

The standard mitigation is a combination of conservatism and fallback logic. On the conservative side, teams typically include tests that cover files in the transitive closure of changed modules, not just direct coverage hits. If payment.rb changed and charge.rb requires payment.rb, then tests covering charge.rb are also candidates. This expands the selected set but reduces the false negative rate.

The fallback handles cases where the map is too stale to trust: large diffs that touch many files, changes to foundational files that appear in the coverage of nearly every test (a base class, a configuration loader, a test helper), or changes to files not present in the coverage map at all, such as newly created files. In these situations, the selective execution system should fall back to running the full suite rather than taking a risk on an incomplete selection.

Calibrating this fallback threshold is where most of the engineering judgment lives. Set it too aggressively and you run full suites too often, defeating the purpose. Set it too permissively and you ship bugs that a missed test would have caught.

How This Compares to Other Approaches

Build-graph tools like Bazel and Buck2 solve a related but different problem. They require every build target to declare its dependencies explicitly in BUILD files. When a source file changes, Bazel walks the reverse dependency graph to find affected targets and rebuilds and retests only those. This is precise and hermetic: a test cannot accidentally depend on a file it didn’t declare.

The catch is that this model requires you to have written all those BUILD files, and to keep them accurate as the codebase evolves. Migrating an existing large Ruby codebase to Bazel is a multi-year project, not a CI optimization. Stripe’s codebase grew organically to 50 million lines before this infrastructure existed. Coverage-based selection is a retrofit, and a pragmatic one.

On the lighter end, Jest’s --onlyChanged flag uses git to find modified files and then walks the static require/import graph to find affected test files. This works well for JavaScript and TypeScript because those languages have static imports. It is much less reliable for Ruby for the reasons described above.

Python’s pytest-testmon takes the coverage-based approach directly. It uses coverage.py under the hood to build file-to-test mappings and stores them in a local SQLite database. The implementation is conceptually identical to what Stripe describes, just for a different runtime and without the distributed infrastructure requirements.

Microsoft’s Test Impact Analysis in Azure DevOps uses dynamic instrumentation at the process level, similar in spirit but operating at a lower layer of the stack.

The Role of Sorbet

Stripe’s position is unusual because they have Sorbet, a gradual type checker for Ruby that they open-sourced in 2019. For code with # typed: strict annotations, Sorbet can resolve constant references statically and build a reasonably accurate call graph. This opens the door to a hybrid approach: use static analysis for well-typed code and fall back to coverage data for untyped code.

A hybrid like this could tighten the map. If Sorbet can tell you statically that OrderService depends on PaymentValidator, you can include that edge in the dependency graph without needing a coverage run to discover it. This matters most for freshly added code, where the coverage map has not yet had a chance to observe the new dependency.

Whether Stripe uses Sorbet in this way for test selection is not fully documented, but the capability exists and it would be the natural use of the type information they have already invested in collecting.

What This Means in Practice

For a developer working in a 50-million-line codebase, the practical effect of this system is that a PR touching a handful of files in one domain triggers a few hundred tests instead of tens of thousands. CI feedback arrives in minutes rather than an hour. The feedback loop tightens without requiring any change to how developers write code or tests.

The correctness guarantee is probabilistic, not absolute. The nightly coverage map can lag, the fallback thresholds can be miscalibrated, and dynamic runtime behavior can always surprise a static analysis system. Teams running selective test execution in production need monitoring to detect escapes: bugs that merged because the relevant test was not selected. In practice, well-tuned systems have low escape rates, but the rate is never zero.

The engineering work behind Stripe’s system is the accumulation of decisions around that gap: how stale is too stale, how large a diff triggers a full run, how to store and serve coverage maps at low latency for many concurrent PR builds, and how to measure whether the selection is actually safe over time. None of that fits in a blog post, but it is where the real work lives.

For anyone maintaining a large Ruby codebase today, the approach is worth stealing. The Coverage module is in stdlib, the inversion logic is straightforward, and the CI time savings at even moderate scale are substantial. The hard part is not building the map; it is deciding when to trust it.

Was this interesting?