· 6 min read ·

The Dependency Graph Ruby Won't Give You: Selective Testing at Stripe's Scale

Source: lobsters

The core idea behind selective test execution is simple: when a pull request changes three files, running the full test suite is wasteful. Run only the tests that depend, directly or transitively, on those three files. Stripe published the details of how they built exactly this for their 50-million-line Ruby monorepo, reporting roughly 80-90% fewer tests per PR and CI feedback that compresses from hours to minutes. The engineering problem underneath that result is almost entirely a consequence of Ruby being dynamically typed.

What Statically Typed Languages Get for Free

In Go, selective test execution is relatively tractable. The compiler parses every import statement and builds an exact dependency graph. Change payments/ledger.go, query go list with the appropriate flags, and you have the precise set of packages that import it, directly or transitively. No instrumentation, no historical data, no approximations.

TypeScript works similarly. Every import is explicit and resolved at build time. Jest’s --onlyChanged flag traverses the import graph from changed files to find affected test suites, and it works reliably because the module graph has well-defined semantics.

In Java and Kotlin, the situation is the same: package declarations, explicit imports, and a strongly-typed compiler that models every dependency. Bazel takes this further with explicit BUILD files; every library target declares its dependencies, so the reverse dependency graph is always available without any dynamic analysis.

In statically typed languages, the dependency graph is a byproduct of compilation. You largely get it for free. Ruby offers no equivalent.

Why Ruby Makes the Dependency Graph Hard to Recover

Ruby’s dynamism is pervasive, and most of it directly undermines static dependency analysis.

require is a regular method call, not a compile-time directive. It can take an interpolated string, appear inside a conditional, or loop over an array of module names. Zeitwerk, the autoloader used in Rails, maps constant names to file paths by convention and loads files lazily at runtime when constants are first referenced. These dependencies never appear in any require chain. Any file can reopen any class: class User defined in five different files is legal Ruby. ActiveRecord’s has_many :orders generates a dozen methods that belong to User but live in association code, with no static link connecting them. method_missing, define_method, const_get, send, and eval create and route method calls in ways that cannot be resolved from syntax alone.

A static analysis pass over a large Ruby codebase produces a dependency graph with significant gaps. It can tell you what a file explicitly requires; it cannot tell you what will actually be loaded when that file executes.

The Coverage-Based Solution

Stripe’s core mechanism is Ruby’s built-in Coverage module, specifically in oneshot_lines mode, which was added in Ruby 2.6:

Coverage.start(oneshot_lines: true)
# run a test
result = Coverage.result
# => { "lib/payment/stripe_charge.rb" => { oneshot_lines: [1, 5, 12, 23] }, ... }

The oneshot_lines mode flips a bit once a line is hit and ignores subsequent hits. Default line-counting mode carries 20-40% overhead on large suites; oneshot_lines brings this into the range of feasibility for nightly full runs.

The process is: run every test in the suite with coverage instrumentation, record which source files each test executed, store this as an inverted index mapping source files to the tests that touch them. A nightly job rebuilds this index on a dedicated worker. When a PR opens, CI fetches the compressed index, computes the union of tests covering each changed file, and runs only those tests.

The transitivity problem largely solves itself. When payment_processor.rb requires ledger.rb, the Ruby runtime loads ledger.rb before any test code runs. Coverage.result includes ledger.rb automatically; the runtime’s own module loading performs the transitive closure computation.

This pattern has a documented history outside Stripe. pytest-testmon applies the identical mechanism in Python using coverage.py. Crystalball from Toptal does this for RSpec with per-example-group coverage. Microsoft’s Test Impact Analysis in Azure DevOps uses binary-level instrumentation for .NET and reports 40-80% reductions in similar configurations. The theoretical foundation goes back to Rothermel and Harrold’s 1994 paper on safe regression test selection, which formalized the core reachability argument. The coverage-based approach is well-understood; the engineering challenge at Stripe’s scale is maintaining accuracy and freshness as the codebase evolves.

The Gap Coverage Leaves, and Where Sorbet Fills It

Coverage-based mapping is empirically correct for code paths that tests actually exercise. It breaks down for paths no test has run recently, for new code with no coverage history, and for lazily-loaded code a test never triggers.

Sorbet, Stripe’s gradual type system for Ruby, developed in-house from around 2017 and open-sourced in 2019, fills part of this gap. As a byproduct of type checking, Sorbet builds a complete cross-reference index: every constant reference, method call, and module inclusion is resolved to a source location. For typed code, this produces a method-level dependency graph that is precise and does not depend on runtime behavior.

Where coverage says “this test depends on everything in payment_processor.rb,” Sorbet can say “this test depends on the concrete implementation of PaymentProcessor#charge.” That precision matters when a change touches only one method in a large file. Coverage would select every test that ever loaded that file; Sorbet selects only tests that call the changed method.

The hybrid works as follows: typed code gets analyzed by Sorbet’s call graph, giving high precision with no staleness risk; untyped code falls back to coverage tracing, which is empirically correct for exercised paths; new code with no history runs conservatively.

A key enabler is Tapioca, developed at Shopify, which generates RBI stub files describing gem public APIs in Sorbet-readable form. This allows Sorbet to reason across gem boundaries without needing gem source, meaning a gem version bump propagates correctly through the affected-file computation rather than triggering a conservative full run.

Staleness, Fallbacks, and the Correctness Asymmetry

Coverage-based systems have a fundamental asymmetry: running an unnecessary test wastes compute; skipping a test that should have run allows a regression through. The system must be conservative, biased toward false positives rather than false negatives.

Stripe handles this with several mechanisms. An infrastructure paths allowlist covers files loaded by nearly every test: initializers, shared factories, Gemfile.lock. A change to any listed file triggers the full suite. Coverage entries older than a freshness threshold are treated as unknown, meaning the corresponding tests always run. New files with no coverage history always run their full directory. If a PR changes enough files that the selected set approaches 90% of the suite, CI falls back to a full run; the savings are marginal and the coverage gaps are not worth the risk.

Merges to main always trigger a full suite run regardless of the selective computation. This backstop catches anything that slips through a stale map. Microsoft TIA applies the same nightly backstop pattern for .NET projects, and it is the right call: the correctness guarantee comes from full runs on the main branch, not from the precision of the selection algorithm alone.

The Compounding Return on Type Annotations

The deeper observation in Stripe’s work is that Sorbet was adopted for type correctness and developer tooling, and it compounded into CI infrastructure that was not the original motivation.

The same dependency graph Sorbet builds for type checking drives selective test execution. That same graph could drive automated blast-radius estimation for proposed changes, test execution ordering by failure likelihood (which reduces time to first signal on a flaky suite), PR summaries of which components a change touches, and large-scale migration tooling that locates every call site for a method being refactored. The typing investment creates a map of the codebase that multiple tools can share.

Teams evaluating Sorbet typically weigh its migration cost against correctness and IDE tooling improvements. The CI argument belongs in that calculation. As more of the codebase gains type annotations, the dependency graph tightens from file-level to method-level, and the test selection becomes more precise. The return is not a one-time gain at adoption; it compounds as coverage increases.

For teams not ready for Sorbet, the Crystalball gem gives any RSpec project the coverage-based foundation at low adoption cost. It does not have the method-level precision of a type-aware graph, but it captures the 80% case: tests that do not load any changed file at all are safe to skip, and coverage tracing finds them reliably. The Sorbet layer is where Stripe’s implementation becomes specific to their investment, and also where the long-term returns are largest.

Was this interesting?