What Stripe's 50M-Line Ruby Monorepo Teaches About Selective Test Execution
Source: lobsters
At some point in the life of a large engineering organization, CI stops being a speed bump and becomes a wall. Stripe crossed that threshold a long time ago. Their monorepo has grown to over 50 million lines of Ruby, and running every test on every commit is no longer viable. The Stripe engineering team recently published a detailed account of how they solved this with selective test execution: a system that figures out which tests actually need to run based on what changed, and only runs those.
The concept is not new. What makes Stripe’s implementation interesting is the specific challenge of doing this in a dynamically typed language with a complex autoloading system, at a scale where even the analysis step has to be fast.
The Core Idea
The fundamental insight behind selective test execution is that most code changes are local. If you change a helper used only by the billing subsystem, there is no reason to run payment routing tests. The system needs a way to map from “files that changed” to “tests that are affected,” and that map is a dependency graph.
Building this graph is where most of the complexity lives. There are two main approaches: static analysis and dynamic tracing.
Static analysis parses source files, follows imports and requires, and constructs a call graph without running any code. It is fast and works at build time, but it struggles with dynamic dispatch, metaprogramming, and runtime-constructed class names. These are not edge cases in Ruby; they are common patterns.
Dynamic tracing runs tests under instrumentation, typically coverage tools, and records which source files are loaded or executed during each test. This captures reality accurately, including all the metaprogramming, but it requires running tests to build the map in the first place, and the map needs to stay current as code evolves.
Most production systems at scale use a hybrid: dynamic tracing to build the initial map, static analysis to augment and extend it for new or changed files, and periodic full runs to keep the data fresh.
Why Ruby Makes This Hard
Ruby’s dynamic nature is the central challenge. The language treats code loading as a runtime operation. require and require_relative can appear anywhere in a file: inside conditionals, inside methods, triggered by metaprogramming. Tools like Zeitwerk, the autoloader used by Rails and widely adopted across the Ruby ecosystem, load constants lazily based on file naming conventions rather than explicit require statements. This means the dependency between two files might never appear as a static relationship in the source; it materializes only at runtime when a constant is first referenced.
Stripe has invested heavily in Sorbet, their gradual type checker for Ruby. Sorbet’s type information provides a more reliable map of constant references than raw text analysis can. If Sorbet knows that Billing::Invoice is defined in app/models/billing/invoice.rb and a given file references Billing::Invoice, that relationship can be extracted statically with reasonable confidence. The combination of Sorbet’s type graph and dynamic coverage data gives Stripe something better than either approach alone.
This is worth noting because Sorbet’s adoption was itself a massive engineering investment. One of the less obvious returns on that investment is exactly this kind of static analysis tooling becoming tractable. Coverage data becomes more trustworthy when it can be cross-referenced against a typed dependency graph.
What the System Actually Does
At a high level, the pipeline works like this:
- A CI run begins. The system computes the set of files changed relative to the base branch using git.
- The dependency graph is queried: for each changed file, which test files transitively depend on it?
- The union of those test files is the selected set. Only those tests run.
- Results are collected. If any test fails, engineers iterate on that smaller set.
The dependency graph is stored externally, not recomputed per run. It is updated continuously as tests execute in CI, with coverage data fed back into a store that gets queried at selection time. This means the latency of graph construction does not add to the critical path of individual CI runs.
The trickiest part of this architecture is staleness. If file A depends on file B, but that dependency was established during a test run from three weeks ago, and file B has since been moved or the dependency has changed, the graph might miss tests that should run. This is a false negative: the worst failure mode, because the test that would have caught the regression simply does not run.
Stripe mitigates this with periodic full runs, executing the entire test suite on schedule regardless of what changed. This keeps the coverage database current and catches any drift between the graph and reality.
Industry Comparisons
Google handles this at a different level entirely. Their build system, Bazel, requires explicit declaration of all dependencies in BUILD files. If you do not declare the dependency, the build fails. This makes selective test execution almost trivial: the dependency graph is always accurate because engineers are required to maintain it. The cost is the overhead of keeping BUILD files current, which at Google’s scale requires its own tooling and cultural enforcement.
Microsoft uses a predictive test selection system in Azure DevOps that incorporates machine learning over historical test failure data. Rather than pure static or dynamic analysis, it learns which tests have historically failed given specific file changes and weights selections accordingly. This adds a probabilistic layer that can surface tests the dependency graph might not connect, but it introduces a dependency on historical data and can be slow to adapt to new code paths.
Meta’s system for their mobile monorepos uses a combination of Buck’s explicit dependency graph and a heuristic layer that tracks test-to-source file correlations over time. Like Stripe, they run periodic full sweeps to correct for staleness.
For Python teams, pytest-testmon provides a lightweight version of this at the single-project level. It records coverage data to a local SQLite database and reruns only affected tests on subsequent runs. It works well for development workflows but does not address the distributed, high-frequency scale that Stripe is operating at.
For Ruby specifically, test-prof from Evil Martians offers profiling and grouping capabilities for RSpec suites, which can identify slow tests and optimize execution order. It is not a selective execution system, but it addresses the adjacent problem of test suite performance.
The Real Cost Is False Negatives
The engineering challenge with selective test execution is not building the system; it is calibrating trust in it. Every optimization that skips tests introduces a surface area for missed failures. Teams that deploy this kind of system have to track the false negative rate explicitly, meaning they need to compare what the selective system ran against what a full run would have run, and measure how often the selective system skips a test that would have failed.
This requires shadow validation: running full suites in parallel with selective runs on a sample of commits, then comparing outcomes. Without that measurement, the system operates on faith, which eventually breaks down as the codebase evolves in ways the dependency graph does not fully capture.
There is also a cultural dimension. Engineers who know not all tests run on every commit develop different intuitions about CI signal. A green build means something subtly different when it represents 15% of the suite rather than 100%. Teams that adopt selective execution need clear documentation about what the CI signal guarantees and under what conditions a full suite run is required, such as before a production deploy or when modifying shared infrastructure.
What Smaller Teams Can Take From This
Most teams do not have 50 million lines of Ruby. But the underlying problem, CI that takes too long to run full suites, appears at much smaller scales, and the approach scales down reasonably.
For teams already on Bazel or Buck, the explicit dependency model means selective execution comes nearly for free. The investment is in adopting those build systems in the first place, which carries significant upfront overhead.
The middle path, which is where most teams land, is convention-based heuristics: use git to identify changed files, map those to test directories through naming convention, and run a representative subset. It is less precise than a graph-based system but captures the majority of the benefit with far less infrastructure.
Stripe’s detailed writeup is worth reading for anyone thinking seriously about CI performance at scale. The specific choices they made around Ruby’s dynamic loading are instructive even if your stack is different. The problem of mapping changes to affected tests is universal; the implementation details change by language and build system, but the dependency graph at the center of it does not.