What Coverage Data Can Do That Static Analysis Cannot in a Ruby Monorepo

Stripe’s engineering blog recently described their selective test execution system — a way to run only the tests that could be affected by a given pull request, shaving CI times from 60-90 minutes down to roughly 10-15 minutes on typical PRs, skipping around 70-80% of tests. The high-level idea is simple enough: figure out what changed, figure out what tests cover those things, run only those tests.

But the gap between that sentence and a working system at 50 million lines of Ruby is where the interesting engineering lives. The central challenge is one that developers working in Go, Rust, or Java largely avoid: in dynamically-typed languages, the dependency graph between files is not fully knowable without running the code.

The Static Analysis Dead End

The first instinct for any test selection system is static analysis. Parse the source files, find all require and require_relative statements, build a directed graph where an edge means “this file loads that file,” then compute which tests transitively depend on any changed file. This is essentially what Bazel’s query language gives you for free in build systems with explicit dependency declarations:

bazel query "rdeps(//..., set(//path/to:changed_lib))"

In a language like Go, explicit imports make this graph complete by construction. Every dependency is declared, and the compiler enforces it. You can build a perfect dependency graph with a single parse pass.

Ruby is a different situation. The require statement can take any string expression, not just a literal:

require "#{platform}_adapter"
require config[:payment_provider]
PROVIDERS.each { |p| require "providers/#{p}" }

Beyond dynamic require, Rails’ Zeitwerk autoloader loads constants lazily on first reference. A class like PaymentIntent may never appear in any require statement — it is loaded automatically when the constant is first accessed. Metaprogramming adds more edges: const_get, define_method, respond_to_missing?, and method_missing can create load-time dependencies that have no syntactic expression you can grep for.

Static analysis of Ruby source finds the easy edges but misses enough of the real graph that you cannot trust it alone. Miss an edge in your dependency graph and you will skip a test that should run; skip that test and you will ship a regression. At any meaningful scale, that is not a trade-off you can accept.

Runtime Coverage as a Dependency Graph

The alternative is to observe the dependency graph empirically by running the tests. Ruby’s built-in Coverage module records which files are loaded and which lines are executed during a test run. The key API for per-test granularity is Coverage.result(stop: false, clear: true), introduced in Ruby 2.5:

Coverage.start(lines: true)

RSpec.configure do |config|
  config.around(:each) do |example|
    Coverage.resume
    example.run
    result = Coverage.result(stop: false, clear: true)
    TestCoverageStore.record(example.id, result.keys)
  end
end

The stop: false argument keeps the Coverage module active across multiple tests; clear: true resets the accumulated data after each snapshot. Without this API (which did not exist before Ruby 2.5), you would have to restart Coverage for each test, which is prohibitively expensive at scale.

After a full test run, you have a mapping from every test to every file it loaded or executed. This is the forward coverage map: test → set of files. You invert it to get the map you actually need for selection: file → set of tests that covered it.

When a PR changes files A, B, and C, you look each up in the inverted map and take the union of their test sets. That union is the only subset of the test suite that needs to run. Everything else did not load any changed file and therefore cannot fail because of those changes.

The coverage data needs to live somewhere persistent and accessible from CI. A shared object store works naturally: after a full run, serialize the inverted map to a MessagePack or JSON blob and upload it; at the start of each PR’s CI run, download and deserialize it. The serialized map for a large monorepo can be substantial, but it compresses well and a single fetch at CI startup is manageable.

Comparison: When the Build System Gives You the Graph

It is worth noting what Stripe had to build compared to what Google and Meta get from their build systems. Google’s Build and Test platform (TAP) runs roughly 150 million test cases per day. Selective execution is central to making that possible. But because Google’s monorepo uses Blaze with explicit BUILD files, every dependency is declared:

# BUILD
cc_library(
    name = "payment_processor",
    srcs = ["payment_processor.cc"],
    deps = [
        "//stripe/crypto:aes",
        "//stripe/db:client",
    ],
)

The dependency graph is not inferred or measured — it is enforced by the build system. rdeps(//..., //stripe/payment_processor) returns the exact set of targets that transitively depend on it. No coverage data is needed; no risk of missed edges from dynamic loading. The build system’s correctness guarantees subsume the test selection problem.

Meta’s approach with Buck2 is structurally identical for the static layer, though they add a coverage layer on top for dynamic dependencies that TARGETS files do not capture — the same hybrid strategy Stripe uses, but starting from a stronger static foundation.

For a pure Ruby codebase without a Bazel-style build system, that static foundation does not exist. You build the dependency graph from coverage observations, accept that it is an approximation, and then engineer around the approximation carefully.

Safety Valves and the Staleness Problem

Coverage-based test selection has one structural problem: the coverage map is a snapshot of a past state. If file B starts loading file A after coverage was collected but before the PR under test was created, your map will not record that dependency. A change to A will not trigger B’s tests. You will miss the failure.

Production systems address this with several safety mechanisms:

Age-based full runs. If coverage data is more than N commits or N days old, run the full suite regardless. Azure DevOps’ Test Impact Analysis defaults to running everything every 50 builds. Stripe uses a similar configurable threshold.
Changed-too-much fallback. If the diff touches more than X% of the codebase, run everything. The cost of selection analysis approaches the cost of running the full suite anyway, and staleness risk is higher.
Always-run lists. Flaky tests, integration smoke tests, and security-critical test paths run unconditionally. They are maintained manually and kept small.
Unknown file fallback. Any file not in the coverage database — newly created, recently renamed — causes all tests to run. When in doubt, over-test.

The combination of these valves means coverage-based test selection in practice runs a safe superset of the affected tests, where the superset is usually small. The system is conservative by design, not exact. That framing matters: the goal is never to run the minimum possible test set; it is to run a small, correct superset of it.

Tools Doing This at Smaller Scale

The same architecture appears in open-source tools for Python and JavaScript. pytest-testmon stores a SQLite database mapping each test to the files it covered (recorded via coverage.py’s C tracer), along with content checksums for each file. At the start of a test run, it computes which files have changed, queries the database for affected tests, and deselects everything else via pytest’s pytest_collection_modifyitems hook:

-- Simplified schema
CREATE TABLE node (id INTEGER PRIMARY KEY, name TEXT UNIQUE);
CREATE TABLE file (id INTEGER PRIMARY KEY, path TEXT UNIQUE, checksum TEXT);
CREATE TABLE node_file (node_id INTEGER, file_id INTEGER);

Jest’s --onlyChanged flag does something structurally different: it uses a statically-derived dependency graph, built by parsing require and import statements with a fast regex extractor rather than full AST parsing. This works reasonably well for JavaScript because require(variable) is uncommon in well-structured codebases. It fails at the same place Ruby’s static analysis fails: dynamic imports, plugin registries, and code that builds module paths at runtime.

The Stripe system sits between these two approaches and the Google/Meta approach: more rigorous than Jest’s static graph because it uses actual runtime coverage, less infrastructure-dependent than Bazel because it requires no BUILD file convention, and engineered for a scale that pytest-testmon was never designed to reach.

What This Means for Dynamic Languages at Scale

Dynamic languages accumulate a specific kind of technical debt as codebases grow: the dependency graph becomes progressively harder to reason about without running the code. The same properties that make Ruby expressive — autoloading, metaprogramming, open classes — are what make static analysis incomplete as a basis for test selection.

At small scale this does not matter much; you run all the tests and it takes a few minutes. At 50 million lines, it determines whether your CI is a 15-minute feedback loop or a 90-minute one. The answer Stripe found — build the dependency graph from empirical coverage observations, store it centrally, query it at PR time — is the right architecture for this class of problem.

Coverage data is frequently treated as a reporting artifact: a number that tells you how much of your code is tested. In a large dynamic-language codebase, it is something more useful than that. It is a runtime-accurate model of your codebase’s dependency structure, and Stripe’s system is a concrete demonstration of how far that model can take you.