How Stripe Runs Only the Tests That Matter in a 50-Million-Line Ruby Codebase
Source: lobsters
Running your full test suite on every commit is a reasonable default until it stops being reasonable. For most teams that point arrives somewhere around a few hundred thousand lines of code and a few thousand tests. For Stripe, it arrived at a scale most engineering teams will never face: 50 million lines of Ruby in a single monorepo.
Stripe’s engineering blog recently described how they built selective test execution into their CI pipeline, running only the tests that are actually relevant to a given change. The result is CI feedback that would otherwise take hours arriving in minutes. The mechanism is interesting, but so is the problem they had to solve first: figuring out which tests are “relevant” in a language that was not designed with static analysis in mind.
The Scale Problem Is Real
Fifty million lines of Ruby is not a number most people have intuition for. For context, the Linux kernel is around 27 million lines of C. The Ruby codebase at Stripe dwarfs it, and unlike a kernel, it is all application code: payment processing, billing, fraud detection, infrastructure tooling, internal APIs, and everything in between.
A monorepo at this scale has real CI physics. Even if each test takes 100 milliseconds, a test suite with hundreds of thousands of tests will take hours to run serially. Parallelism helps, but parallelism has costs too: machine time, queue time, result aggregation, flake surface area. The industry-standard response to this is sharding, running the suite across N workers simultaneously. Stripe almost certainly does this. But sharding still runs every test on every PR. Selective execution is a fundamentally different strategy: run fewer tests, not the same tests faster.
The challenge is knowing which tests to skip.
How Dependency Graphs Power Test Selection
The core idea behind selective test execution, also called regression test selection (RTS) in academic literature, is constructing a mapping from source files to the tests that exercise them. When a file changes, you run the tests that depend on it, and nothing else.
This sounds straightforward but the devil is entirely in how you build the dependency graph. There are two broad approaches: dynamic analysis and static analysis.
Dynamic analysis means running tests with coverage instrumentation and recording which source files each test touches. This is the approach pytest-testmon takes for Python, and it works well: every test run teaches the system more about dependencies. The downside is that the graph is only as complete as your test runs. New code with no coverage history has unknown dependencies, and you have to fall back to running more tests conservatively.
Static analysis means parsing the code and tracing imports, method calls, and class hierarchies without executing anything. This is faster and does not require a prior test run, but for dynamically-typed languages it is substantially harder. Ruby’s metaprogramming capabilities, method_missing, const_get, send, and similar patterns can route calls to methods that no static analyzer can see at parse time.
Sorbet Changes the Equation for Ruby
Stripe built Sorbet, a gradual type checker for Ruby, starting around 2017 and open-sourced it in 2019. Sorbet’s primary purpose is type safety, catching type errors before runtime. But the type annotations and the call graph that Sorbet constructs are valuable infrastructure for a great deal more than type checking.
A type-annotated Ruby codebase is, in important ways, closer to a statically-typed language for the purposes of analysis. When a method signature is known, a call to that method can be resolved at analysis time. The class hierarchy is explicit. Module inclusions are traceable. This is what makes precise static dependency analysis tractable for Stripe in a way it would not be for an unannotated Ruby codebase.
Stripe’s selective test execution system almost certainly leans on Sorbet’s analysis infrastructure to build the file-to-test dependency map. This is not just a practical convenience; it is architecturally significant. The investment in type checking paid a second dividend.
What Other Ecosystems Learned First
Stripe is solving a problem that other ecosystems encountered earlier, and the prior art is instructive.
Google’s internal build and test system, TAP (Test Automation Platform), has done affected-test selection at massive scale for many years. Google’s advantage is that they use Bazel for builds, and Bazel’s explicit BUILD file dependency declarations make the dependency graph a first-class artifact. Every java_library and cc_library target declares its dependencies, so computing the transitive closure of what a change affects is a graph traversal over known edges. Bazel’s remote caching extends this further: if a target’s inputs have not changed, its outputs are served from cache, and downstream tests that only depend on those cached outputs never run at all.
Microsoft Research published work on predictive test selection that takes a different approach: using historical test failure data and file change patterns to predict which tests are likely to fail, without constructing an explicit dependency graph. This is more robust to metaprogramming but has different failure modes: the model can be wrong in both directions, missing failures or flagging unnecessary tests.
The JavaScript ecosystem landed somewhere in between. Jest ships with --onlyFailures and --changedSince flags, and its dependency graph is built from the module system. Because Node.js modules declare dependencies explicitly through require and import, the dependency graph is relatively cheap to construct accurately. Nx and Turborepo extend this to the monorepo level, computing affected packages from the project dependency graph and the git diff.
The Correctness Problem
Any selective test execution system faces an asymmetric risk. Running a test that did not need to run wastes time but causes no harm. Skipping a test that should have run and missing a regression is a serious failure.
This asymmetry means the safe design is conservative: when uncertain, run the test. The engineering work in a system like Stripe’s is mostly in the uncertainty reduction, getting the dependency graph accurate enough that the conservative fallback rarely triggers.
Several categories of change require running everything regardless. Changes to shared infrastructure like test helpers, configuration files, or build tooling can affect any test. Changes to language runtime versions or gem dependencies are similarly global. Stripe almost certainly maintains an explicit list of such “universal” paths where any modification bypasses selection entirely.
There is also the question of transitive dependencies. If file A is changed, and file B depends on A, and test T depends on B but not directly on A, the system must trace the full transitive closure to find T. This is a graph reachability problem, and at 50 million lines the graph can be large. The practical implementation needs to be efficient: precomputed adjacency lists, incremental updates when files change, possibly a persistent graph store that is updated as part of the CI pipeline setup.
What This Costs to Build and Maintain
Selective test execution is not a feature you drop in over a weekend. Stripe’s system represents years of investment in type checking infrastructure, dependency graph construction, CI tooling integration, and ongoing maintenance as the codebase evolves.
The maintenance cost is often underestimated. The dependency graph must stay accurate as the codebase changes. New metaprogramming patterns that the static analyzer cannot follow must be identified and either restricted or handled specially. Test helpers that get added to the shared stack must be recognized as universal dependencies. Gem updates must be treated conservatively.
For teams not at Stripe’s scale, tools like pytest-testmon, Jest’s built-in affected detection, and build systems like Bazel or Pants offer selective execution without building the infrastructure from scratch. The right tool depends heavily on the language, the build system, and the degree of metaprogramming in the codebase.
For Ruby specifically, the path to selective execution goes through type coverage. The more of the codebase that Sorbet can analyze, the more precise the dependency graph becomes. That is probably the most transferable lesson from what Stripe has built: investing in static analysis infrastructure pays off in ways that compound over time, and CI speed is one of the less obvious dividends.