· 6 min read ·

Selective Test Execution at Scale: Why Ruby Makes It Harder Than It Looks

Source: lobsters

The mathematics of CI debt

When a codebase grows to 50 million lines, the naive approach to CI breaks down not gradually but catastrophically. A test suite that takes 10 minutes at 100,000 lines might take 8 hours at 50 million, even with generous parallelism across workers. Every developer’s pull request becomes a scheduling problem: block merges on a full suite run and absorb the cost in developer time and infrastructure spend, sample tests randomly and accept the risk of missed failures, or invest in something more principled.

Stripe’s engineering post describes their answer: a selective test execution system that analyzes which files changed in a pull request and runs only the tests that could plausibly be affected. The result is dramatically faster CI for the common case. The engineering behind it, though, reveals how much harder this problem is in Ruby than in a statically-typed language, and why the interesting work is not the speed gains but the confidence story.

Static vs. dynamic dependency resolution

The core challenge in selective test execution is computing the dependency graph. Given a set of changed files, you need to determine which tests might exercise code in those files. In TypeScript or Java, this is tractable: imports are explicit, types are resolved at compile time, and build tools like Bazel or Nx can maintain a precise graph of what depends on what. Nx ships with nx affected --target=test built in; Bazel’s hermetic build model makes test selection nearly automatic for codebases that fully adopt it.

Ruby is different in ways that matter here. The language is dynamically typed, has open classes that allow any code anywhere to reopen and modify any existing class, supports method_missing for proxy patterns, and makes heavy use of DSLs that transform method calls into runtime behavior. When a file changes, the question of which tests exercise code that calls into that file cannot be answered by reading the source alone.

Consider a concrete case: a change to a utility method Util.format_amount. A static analysis pass can find direct callers by searching for format_amount in the codebase. But a proxy class using method_missing might route calls to it dynamically. A module included conditionally based on environment config might call it only in certain code paths. A metaprogramming macro might generate delegation methods at class load time. Static analysis gives you a lower bound on the affected tests, not the true set, and the gap between those two can be significant in a codebase with heavy Ruby idioms.

This pushes selective execution systems for Ruby toward dynamic analysis: instrumenting actual test runs to record at runtime which tests exercise which files. Microsoft has described their Test Impact Analysis system for large .NET codebases using exactly this principle. Run all tests once with coverage instrumentation. Store the per-test, per-file mapping. For each subsequent change, look up which tests covered the changed files and run only those. The stored mapping reflects actual runtime behavior rather than a programmer’s model of it, which is the right property to have for a dynamic language.

The staleness problem

Coverage-based selection introduces a new problem that static graph approaches do not have: the mapping goes stale. Code changes, tests are added and deleted, code paths are restructured. A test that covered payments/charge.rb last week might have been deleted, split into three tests, or restructured so it no longer exercises the same paths. The stored file-to-test mapping has to be refreshed regularly, which means paying the cost of full suite runs on some cadence.

The staleness window defines the safety margin. If you regenerate the coverage map once per day, a test added Sunday night might not appear in Monday’s map. Practical systems handle this with layered strategies. One option is regenerating the map on every merge to the main branch; this is expensive but keeps the map synchronized with production state. Another option is treating the map as a filter with a conservative fallback: for files modified heavily since the last map generation, or for files the map has low confidence about, run all tests that reference them regardless of what the map says.

A third option, which is less commonly discussed but valuable, is running periodic verification builds. These builds run the full test suite and check whether selective execution, applied in retrospect to the same set of changes, would have missed any failures. If verification builds surface misses, the selection logic tightens. If they consistently find nothing, you accumulate evidence that the system is working correctly. This is how you build statistical confidence in a selection system without paying the full cost on every PR.

Blast radius classification

Not all changes have the same scope. A change to a deeply internal utility class might affect dozens of tests. A change to the base class that 80% of models inherit from might affect thousands. Selective execution systems need to handle both, which means classifying changes by their blast radius before deciding how aggressively to filter.

The practical approach is explicit thresholds: if a changed file has more than some number of dependents in the coverage map, fall back to running the full suite for that pull request. This sounds like a failure mode but is often the right engineering decision. The value of selective execution comes from the common case of localized changes. When someone touches the ORM base class, the authentication middleware stack, or the HTTP client library, the conservative behavior is to run everything, because the blast radius is genuinely unbounded.

In a 50-million-line monorepo, both categories of change are common. A product engineer fixing a bug in a specific payment flow might touch two files that map to a few dozen tests. A platform engineer modifying shared infrastructure might touch one file that warrants running thousands of tests. The selection system has to distinguish these cases reliably, without requiring engineers to annotate their own changes or reason about blast radius manually.

The Ruby tooling landscape

Compared to the JavaScript ecosystem, Ruby’s tooling for this problem is sparse. SimpleCov provides line-by-line coverage tracking for Ruby and is the natural starting point for a coverage-based selection system. But SimpleCov’s standard output is per-run aggregate data. Building a per-test, per-file coverage map requires either patching SimpleCov to track coverage at the individual example level, or instrumenting RSpec directly to record which files each example group loads and exercises during a run.

The latter is what a serious implementation requires. RSpec’s formatter API and around hooks make this feasible, but the instrumentation overhead is non-trivial at scale. Every test run that updates the coverage map adds measurement cost; the cadence of map updates has to be chosen carefully so that freshness is maintained without making full instrumented runs prohibitively expensive.

Knapsack Pro addresses an adjacent but orthogonal problem: given a known set of tests, how do you distribute them across parallel CI workers to minimize total wall-clock time? That is about scheduling, not selection. The two approaches compose well: select 30% of the tests based on coverage data, then distribute those tests efficiently across workers. The speedups multiply rather than add.

Confidence, not just speed

The engineering tension in selective test execution is always between speed and correctness. Running fewer tests is faster; the risk is missing a failure that a full run would have caught. This makes the validation story as important as the selection algorithm.

Stripe’s framing is CI speed, and that is the visible outcome. The less visible work is maintaining the safety margin: measuring false negative rates, keeping the coverage map fresh, classifying blast radii correctly, and running periodic verification to confirm that selective execution is not drifting away from correctness over time.

For teams considering similar systems, the key insight is that selective execution is not a binary switch. The useful model is a spectrum: aggressive selection for small, localized changes; conservative selection for riskier ones; full runs for high-blast-radius or high-stakes changes. Building that spectrum with explicit thresholds and ongoing measurement is the actual engineering challenge. The selection algorithm is almost the easy part. Knowing when to trust it, and how to detect when you should not, is where the work is.

The specific techniques Stripe used to build a reliable file-to-test mapping in Ruby, and the tradeoffs they accepted in doing so, are worth close attention for any team sitting on a large test suite and watching CI times compound quarter over quarter.

Was this interesting?