Selective Test Execution in Dynamic Languages: Lessons from Stripe's 50M-Line Ruby Monorepo
Source: lobsters
Every large engineering organization eventually hits the same wall: the test suite grows faster than the hardware budget. At some point, running every test on every commit stops being a reasonable policy, and the question becomes how to run fewer tests without missing regressions.
Stripe recently published a detailed account of how they tackled this for their main Ruby monorepo, a codebase exceeding 50 million lines. The result is faster CI through selective test execution: running only the tests most likely affected by a given change, rather than the entire suite. The engineering behind it is considerably more interesting than the summary suggests, particularly because Ruby makes test impact analysis harder than it first appears.
What Test Selection Actually Requires
The core idea is simple: build a mapping from source files to the tests that exercise them, then on each commit, consult that mapping to identify which tests are relevant to the changed files.
The mapping is the hard part.
There are two broad strategies for building it. Static analysis reads source code and traces import chains, call graphs, or explicit dependency declarations to determine which code loads which. Dynamic analysis runs the tests with coverage instrumentation and records which lines each test actually exercises. Each approach has different strengths, and the right choice depends heavily on the language.
Why Ruby Resists Static Analysis
Ruby’s dynamism is both its most useful property and its most difficult one for tooling. Several features break static dependency analysis in ways that are hard to work around.
method_missing intercepts undefined method calls and can dispatch them anywhere at runtime. Dynamic method definition via define_method or class_eval creates methods whose names are not statically visible. send lets callers invoke methods by string or symbol, routing around any call graph a static tool could construct. Zeitwerk, the modern Rails autoloader, defers file loading until a constant is first referenced, so the full dependency graph only materializes at runtime.
ActiveRecord multiplies these problems. A single has_many :orders declaration generates orders, orders=, order_ids, build_order, create_order, and more. The set of methods a class responds to at runtime is not readable from the class file; it depends on associations, concerns, callbacks, and any number of gems that reopen core classes.
This is why teams working with large Ruby codebases tend to lean toward coverage-based approaches. Ruby’s standard library has included a Coverage module since 2.5, and it has grown more capable in recent releases:
# Ruby 2.5+: basic line coverage
Coverage.start
require_relative 'app'
# ... run a test
result = Coverage.result
# { "/path/to/file.rb" => [nil, 1, 0, 3, ...] }
# Ruby 3.2+: branch and method coverage
Coverage.start(lines: true, branches: true, methods: true)
The simplecov gem has built on this for years to give teams readable coverage reports. For selective test execution, the goal is different: instead of a report, you want a reverse mapping from files to tests.
The Coverage-Based Approach
The mechanics work roughly like this. During a baseline run, or continuously over time as tests execute in CI, each test is run with coverage tracking enabled. The coverage data is stored as a mapping: for each source file, which tests executed at least one line in it. Invert that structure and you get a reverse index: for each source file, which tests need to run when that file changes.
When a developer opens a pull request, the CI system diffs the commit against the base branch, collects the changed file set, and queries the reverse index to identify the relevant tests. Only those tests are scheduled.
The precision of this approach is substantially better than static analysis for Ruby because it captures actual runtime behavior. When ActiveRecord generates has_many accessors, and a test calls one of them, the coverage data records that the test touched the model file. No static analyzer needs to understand the metaprogramming machinery that generated the method.
There is a category of case that deserves special handling: shared test helpers, factories, and support files. If a factory definition changes, a coverage-based system might conclude only tests that directly required that file need to run. In practice, the change could affect the behavior of any test that uses what the factory produces. These cases typically require explicit tagging or a conservative fallback that runs all tests using the relevant support file.
The Freshness Problem
Coverage data has a shelf life. Code evolves, tests get added and removed, and a mapping built from coverage collected weeks ago may not reflect current dependencies. A file that was lightly tested before may now be exercised by many new tests. More dangerously, a file that has quietly become infrastructure for a large part of the codebase may not show up that way in stale coverage data, leading the system to under-select tests when it changes.
This creates a tension between coverage data freshness and the cost of regenerating it. Full coverage collection requires running the entire test suite, which is exactly what selective execution is trying to avoid. The practical resolution is a combination of strategies: periodic full runs on a schedule or on merge to the main branch, incremental updates from each individual test run, and a conservative fallback when the mapping is uncertain or the changed files fall outside what the coverage data covers.
The correctness guarantee matters more than the speed gain. When coverage data is ambiguous, or when a changed file touches infrastructure that plausibly affects everything (initializers, configuration loaders, base classes), running more tests is the right call. A regression that escapes to the main branch because a test was incorrectly skipped costs far more than a few extra minutes of CI time.
How Other Ecosystems Handle This
The comparison with statically typed and compiled languages shows why Ruby makes this harder than it needs to be.
In Go, the module graph is explicit and complete. go test ./... combined with go mod graph gives a precise picture of which packages depend on which. Tools like gotestsum can implement affected-package detection without relying on runtime data because the import graph is reliable.
JavaScript tooling for monorepos, particularly Nx and Turborepo, builds project graphs from explicit imports and package.json dependency declarations. Jest’s --onlyChanged flag uses git to identify changed files and traverses the module dependency graph. This works well for TypeScript where imports are always explicit, though dynamic require() calls still create gaps.
Bazel and Pants take a different approach entirely: they require explicit BUILD files that declare dependencies, making the dependency graph a first-class artifact that build and test tooling can consume directly. This demands more maintenance but gives the most precise and reliable selective execution. Google’s internal build system, which Bazel is modeled on, runs tests at a scale that dwarfs any single company’s monorepo, using this kind of explicit graph as the foundation.
Stripe occupies an unusual position in the Ruby ecosystem because they also maintain Sorbet, a static type checker for Ruby. Sorbet’s type signatures and its understanding of method definitions give static analysis tools more to work with than bare Ruby. Having Sorbet across a codebase of this size means some of the dynamism that defeats purely static approaches is partially tamed, and the combination of Sorbet’s static information with runtime coverage data is likely more powerful than either approach alone.
Infrastructure at This Scale
At 50 million lines and the test suite size that implies, the machinery surrounding selective execution is substantial. You need persistent storage for the coverage mapping, keyed in a way that survives file renames and refactors. You need a query service that CI can call during the test setup phase to retrieve the relevant test set quickly. You need logic to handle edge cases: deleted files, moved files, new tests with no coverage history, and changes to files that are loaded by nearly every test.
You also need observability into the system itself. When a regression escapes, you need to know whether selective execution was responsible and, if so, why the relevant test was not selected. This requires logging the selection decisions and making them auditable after the fact.
The economics are compelling at scale. A test suite that takes 45 minutes with full runs dropping to under 10 minutes with accurate selective execution is not just a CI cost reduction; it compresses the entire feedback loop for every engineer on the platform. Code review moves faster when the checks finish sooner. Broken builds are caught and fixed before they block others. The aggregate effect on team velocity is larger than the raw time numbers suggest.
What Compounds Over Time
Once a reliable file-to-test mapping exists, it becomes useful for more than just skipping tests. The same data can estimate the blast radius of a proposed change before it merges, helping reviewers prioritize attention. It can order test execution so the tests most likely to fail on a given diff run first, reducing the time to first signal. It can feed into pull request summaries, highlighting which components a change touches based on what tests cover them.
The coverage pipeline, in other words, is infrastructure that pays for itself multiple times. The initial investment in collection, storage, and querying is significant, but the data it produces is general enough to support a range of tooling on top.
What Stripe’s work illustrates, across its full scope, is that monorepo tooling at this scale is its own engineering discipline, not a configuration problem. The right mapping strategy depends on the language’s static analysis surface, the quality and freshness of coverage data, and the acceptable tradeoff between precision and safety. For Ruby, coverage data is the foundation because static analysis cannot be trusted to be complete, and building that foundation reliably at 50 million lines is the actual engineering challenge.