· 6 min read ·

The Dependency Graph Problem Behind Fast CI in a Large Ruby Monorepo

Source: lobsters

Stripe recently published a detailed write-up on how they built selective test execution for their Ruby monorepo, which has grown to around 50 million lines of code. The core idea is familiar: rather than running every test on every pull request, figure out which tests are actually affected by a given set of changes and run only those. At the scale Stripe operates, the difference between “all tests” and “relevant tests” can be the difference between a CI run that takes an hour and one that takes a few minutes.

The engineering involved in getting there is less obvious than the concept.

The Graph Problem

At its core, selective test execution is a graph reachability problem. You have a directed graph where nodes are files (or modules, or build targets, depending on how fine-grained you get), and edges represent dependency relationships: file A depends on file B if A’s behavior can be affected by changes to B. When a pull request modifies a set of files, you walk the reverse edges from those files to find every test that transitively depends on them.

In ecosystems with explicit build files, this graph is already known. Bazel and Buck require developers to declare dependencies in BUILD files, so the graph is a first-class artifact of the build system. Google’s internal CI (TAP) and Meta’s similar infrastructure both lean on this explicit graph heavily. When you know the graph with certainty, test selection is straightforward and essentially free.

Ruby does not give you this.

Why Ruby Is Specifically Hard

Ruby’s runtime model makes static dependency analysis genuinely difficult, not just inconvenient. A few concrete reasons:

require can be dynamic. In Python, import statements are at least syntactically distinct and usually at the top of a file. In Ruby, require is just a method call, and its argument can be any expression:

["payments", "subscriptions", "invoices"].each do |mod|
  require_relative "services/#{mod}"
end

A static analyzer looking at this file has no way to know which files will actually be loaded without evaluating the runtime value of mod.

Zeitwerk and Rails autoloading. Modern Rails applications use Zeitwerk for autoloading, which maps filesystem paths to constant names and loads files on demand. This means many files in a large Rails codebase are never explicitly required anywhere; they’re loaded when their corresponding constant is first referenced. Tracing these relationships statically requires understanding Zeitwerk’s resolution algorithm and the project’s load path configuration.

Metaprogramming and open classes. Ruby allows any class to be reopened and modified from any file. A concern mixed into User from app/models/concerns/billable.rb creates a behavioral dependency between billable.rb and every test that exercises User, but nothing in the file structure or require graph makes this explicit. The concern might be included with a single include Billable line that itself gets evaluated only under certain conditions.

method_missing and respond_to_missing?. Call graph analysis, which works reasonably well in statically typed languages, largely breaks down in Ruby. You cannot reliably determine at parse time which method calls will resolve to which method definitions.

Two Approaches: Static Analysis vs. Coverage Tracing

Given these constraints, there are two broad strategies for building the dependency graph.

The first is static analysis: parse all Ruby files, extract require and require_relative calls where the argument is a string literal, resolve Zeitwerk autoload paths, and construct a best-effort dependency graph. This approach is fast to compute and doesn’t require running code, but it will miss dynamically constructed require paths and implicit autoload dependencies. The graph will be incomplete, which means tests might be incorrectly excluded.

The second approach is coverage-based tracing: instrument the test suite to record which files are loaded or executed during each test run, then persist this mapping. When files change, look up which tests previously touched those files. This approach is correct by construction for any test that has been run at least once, but it requires an initial full run to bootstrap the mapping, and the mapping becomes stale as the codebase evolves.

Stripe’s system, based on their write-up, combines both: static analysis provides a baseline dependency graph, while coverage data from prior runs refines and corrects it. SimpleCov and Ruby’s built-in Coverage module can record per-file execution data at the test level, though doing this efficiently at scale requires careful instrumentation to avoid turning the coverage collection itself into a bottleneck.

A key engineering detail in any coverage-based system is handling the staleness problem. If file B changes, the coverage data mapping tests to file B was recorded against the old version of B. The mapping is probably still useful (the same tests that used to touch B likely still do), but you need a policy for when to invalidate and rebuild mappings. Stripe’s approach likely uses a combination of file modification times, content hashes, and periodic full runs to keep the mapping accurate.

The Sorbet Connection

Stripe is also the team behind Sorbet, a gradual type checker for Ruby. This is relevant to test selection because Sorbet’s type information provides a significantly more accurate view of dependency relationships than raw require analysis alone. When Sorbet knows that class PaymentMethod is referenced in subscription_service.rb, and it has type signatures for both, it can build a more reliable call graph and identify behavioral dependencies that would be invisible to a pure require-tracing approach.

The fraction of Stripe’s codebase covered by Sorbet types is not public, but they’ve been investing in typed coverage for years. A monorepo that’s even 60-70% typed gets substantial benefits for dependency analysis: the typed portions can be analyzed statically with high confidence, reducing reliance on the noisier coverage-based approach.

Comparison with Other Ecosystems

This problem has different shapes in different languages.

In TypeScript, the compiler’s module resolution is deterministic and explicit (relative imports, node_modules resolution). Tools like Jest with --testPathPattern or more sophisticated tools like affected in Nx can construct dependency graphs from import statements with high confidence. The dynamic import case (import()) adds some complexity, but it’s a much smaller fraction of dependencies than in Ruby.

In Java and the JVM ecosystem, build tools like Maven and Gradle have explicit dependency declarations. The JVM test impact analysis problem still exists at the method level, but module-level dependency graphs are accurate by construction.

Go’s toolchain makes this trivially easy. go test ./... with -run flags can be composed with go list -f '{{.Deps}}' to get precise dependency information. The go list command produces exact import graphs as structured output.

Ruby sits on the difficult end of this spectrum. The dynamism that makes Rails development productive is the same property that makes accurate static analysis of a Rails application hard.

Correctness vs. Coverage

There’s an engineering tension that any selective test execution system has to resolve: do you optimize for never skipping a relevant test (high recall, run more tests), or do you optimize for skipping as many irrelevant tests as possible (high precision, run fewer tests)?

For a production payment infrastructure company, the cost of a false negative (skipping a test that would have caught a bug) is extremely high. Stripe’s system almost certainly errs on the side of recall, which means their “selected” test set is probably a superset of the strictly necessary tests. The dependency graph has conservative fallback rules: if a file’s dependency information is uncertain, include all tests that might plausibly touch it.

This conservatism is the right call. The goal of selective execution is not to find the minimal test set; it’s to find a test set that’s small enough to run quickly while being large enough to maintain confidence. A 90% reduction in test count with 99.9% recall is a strong result. A 95% reduction with 95% recall introduces risk that’s hard to reason about across a codebase at this scale.

What This Means for the Broader Ruby Ecosystem

Stripe’s system is bespoke infrastructure built over years, but the underlying techniques are applicable to any large Ruby codebase. The test-prof gem from Evil Martians has popularized several profiling and optimization approaches for Ruby test suites. Projects like rspec-bisect address related problems in the test tooling space.

The more interesting longer-term development is whether Sorbet and RBS type annotations become precise enough to make static dependency analysis viable for typical Ruby applications, not just at Stripe’s investment level. If the Ruby tooling ecosystem converges on typed-by-default conventions, the graph problem gets substantially easier, and selective test execution becomes accessible without coverage instrumentation infrastructure.

For now, Stripe’s write-up is a useful look at what it takes to make CI fast when your language’s dynamic properties work against you at every step of the analysis.

Was this interesting?