Coverage Maps, Type Graphs, and the Real Difficulty of Selective Test Execution in Ruby
Source: lobsters
Running every test on a 50-million-line codebase is not just slow; it’s economically untenable at any CI parallelism budget that makes sense. When the codebase is Ruby, the dependency analysis problem that underpins selective execution has additional layers that don’t exist in statically typed, explicitly modular languages. Stripe’s writeup on their selective test execution system is worth examining not just as an engineering case study but as a window into a category of problem that most teams encounter, tackle partially, and eventually outgrow.
The Two Fundamental Approaches
The academic literature on this problem goes back to the mid-1990s. Rothermel and Harrold’s 1994 paper on safe regression test selection formalized the core idea: a test T can be safely omitted after a code change if no execution path through T passes through any modified code. Getting this right requires knowing, for any given change, which tests could possibly exercise the changed code. The two main approaches are coverage-based and dependency graph-based, and they have meaningfully different properties.
Coverage-based selection works by recording, for each test, the exact set of source lines or methods it exercised during its last run. When a file changes, any test whose coverage record includes a line from that file must be run. The data is highly precise because it reflects actual runtime behavior rather than static approximation. The crystalball gem, maintained by Toptal, implements this approach for Ruby using MRI’s built-in Coverage module: it records per-test coverage maps and persists them to disk, then consults that map during commit-level analysis to determine which tests are relevant to a given diff.
Dependency graph-based selection builds a graph of which files import which other files, then computes the transitive dependents of any changed file. It’s more conservative than coverage-based selection: any test file that depends, directly or transitively, on a changed file gets included, even if the specific changed lines are never reachable at runtime. It doesn’t require maintaining coverage maps, and it’s immune to the staleness problem that afflicts coverage data. The cost is false positives, tests that don’t need to run but get included anyway.
Why Ruby Makes Static Analysis Difficult
In a statically typed, explicitly modular language, building the dependency graph is tractable. Java has explicit import statements and a resolved class namespace; the compiler sees everything at compile time. Go has explicit package imports and no monkey patching. Rust’s crate and module system is declared in source and enforced by the compiler. For any of these languages, you can build a precise dependency graph from source alone, and a file-level change propagation analysis is both safe and reasonably complete.
Ruby provides none of these guarantees. A require call can take a dynamically constructed string. A class can be reopened in any file, at any time. A module can be included at runtime based on a condition. Method dispatch falls through to method_missing when a method is not found, which means that calling foo.bar might invoke code in a file that has no visible static connection to foo’s class definition. Metaprogramming patterns like define_method, class_eval, and const_get create dependencies that are invisible to any static analysis operating on syntax alone.
The practical consequence is that a static require-graph for a Ruby project overestimates dependencies badly in some places and misses them entirely in others. Overestimation means running more tests than necessary, which reduces the efficiency gain from selective execution. Underestimation is categorically worse: it means silently skipping tests that should have run, which is exactly the failure class that selective execution systems must never exhibit.
Where Sorbet Changes the Calculation
Stripe’s use of Sorbet is the part of this story that doesn’t generalize to most Ruby codebases but matters enormously for Stripe specifically. Sorbet is a gradual type system for Ruby that, at sufficiently high coverage, builds a precise model of the type of every expression in the program, the method resolution order for every class, and the set of methods defined by each module.
This type information is precisely what accurate static dependency analysis needs. Instead of a require graph, a Sorbet-typed codebase lets you build a call graph: when method A calls method B and Sorbet can resolve the type of the receiver at the call site, you know that a change to B’s implementation is a dependency of any test that can reach A. This operates at method granularity rather than file granularity, which is substantially more precise. It respects type-narrowing: if Sorbet knows that a variable has type CreditCard and not PaymentMethod, calls on that variable only depend on CreditCard’s methods, not every implementor of PaymentMethod.
The parallel in statically-typed ecosystems is what TypeScript does with --incremental builds via its .tsbuildinfo files, or what the Rust compiler does with its dependency fingerprinting between crates. Type-aware dependency graphs let you say “this test depends on the concrete implementation of PaymentProcessor#charge” rather than “this test depends on everything in payment_processor.rb.”
The Google/Bazel Comparison
Google’s approach to this problem is instructive because they solved it from the opposite direction. Rather than inferring the dependency graph from code, Bazel requires developers to declare it explicitly in BUILD files. Every library, binary, and test target lists its dependencies. This makes the dependency graph always accurate and complete because the build system enforces correctness: if you omit a dependency, your build fails.
With explicit BUILD files, test impact analysis becomes a graph traversal problem solvable in milliseconds. Changed file A is a target; all targets that depend on A are computable from the graph; the set of affected tests is exact. Google’s internal test automation platform runs billions of tests per day partly because this dependency accuracy makes aggressive test selection safe.
The trade-off is maintenance cost. In a codebase like Stripe’s, where Ruby files and classes evolve constantly, requiring engineers to manually maintain dependency declarations for every file would create significant friction. Using Sorbet’s type graph as an automatically-derived dependency oracle is the more ergonomic answer to the same underlying problem. It’s closer in spirit to what tools like Bazel’s gazelle attempt for Go codebases: automatic BUILD file generation from source analysis.
The Staleness Problem
Coverage-based approaches have a failure mode that dependency graph-based approaches don’t: the coverage map can become stale. Suppose a test was last run against commit N. Since then, commit N+1 added a new require statement in a shared utility file, creating a new dependency between two modules. The coverage map from commit N doesn’t know about this dependency. If the new dependency routes through changed code, the test should run, but the stale coverage map says it shouldn’t.
Handling this correctly requires treating coverage data as having an expiry. If a test’s coverage map was recorded more than K commits ago, include the test unconditionally. Or regenerate coverage maps on every merge to the main branch and use those maps for all PRs that descend from that merge point. The freshness window becomes a tunable parameter that trades completeness (shorter window) against efficiency (longer window). Tests with no coverage data at all, because they’re new or were added after the last full coverage run, should always be included; there’s no safe alternative.
Microsoft’s Test Impact Analysis for Azure DevOps handles this using binary-level instrumentation to track coverage at method granularity rather than line granularity, and they regenerate the impact map on every test run rather than storing a persistent snapshot. The per-run overhead is higher but the staleness risk disappears. Different points on the same trade-off curve.
Safety Valves and Correctness Guarantees
The correctness requirement for selective test execution is asymmetric. Running an unnecessary test wastes compute. Missing a test that should have run lets a regression into the codebase, potentially into production. A system that occasionally skips needed tests will erode confidence in CI, which is worse than having slow CI.
This asymmetry drives the need for explicit fallback behavior. Selective execution should fall back to running the full suite when coverage data is too old, when the change is too large or touches too many files, or when the changed files include infrastructure that the test suite itself depends on: test helpers, database setup code, configuration loaders, anything that virtually every test file requires implicitly. Knowing which files are load-bearing for the test infrastructure itself is a meta-dependency problem that requires care and usually manual annotation.
There’s also the question of confidence thresholds. If selective analysis determines that 90% of tests should run anyway, the savings from running the remaining 10% are small enough that falling back to the full suite may be the right call, especially if the coverage data is not fresh. Most production implementations of selective test execution include some form of this heuristic.
What the Reduction Means in Practice
The practical value compounds across a CI system. In a merge queue model, where every PR waits for CI to pass before landing, reducing per-PR test runs significantly translates to higher merge throughput at the same infrastructure spend. The wall-clock time for a single CI run may improve less dramatically than the total compute savings, because test runners are heavily parallelized, but queue pressure and resource cost both improve meaningfully.
There’s also a feedback loop. Shorter CI times encourage smaller, more focused commits. Smaller commits affect fewer files, trigger fewer tests, and complete CI faster. The system becomes self-reinforcing when the granularity of commits matches the granularity of test impact analysis.
Stripe’s investment in this problem, and their ability to use Sorbet’s type graph as a dependency oracle, puts them in an unusual position. The combination of scale (50M lines), language (Ruby), and tooling (Sorbet) makes their approach worth studying even for teams that will never operate at that size. The fundamental question, what does this change actually affect, is one every growing codebase eventually has to answer. In static languages, the answer often falls out of the type system for free. In Ruby, getting it right requires building the infrastructure to derive that answer from the type information you’ve chosen to add.