· 6 min read ·

How You Build a Test Dependency Graph for 50 Million Lines of Ruby

Source: lobsters

The bottleneck in most CI pipelines at scale is not compilation, not deployment, not flakiness management. It’s that the full test suite takes too long to run on every commit, so you either wait for it or ship without confidence. Stripe’s engineering post on their selective test execution system for a 50-million-line Ruby monorepo addresses this directly, but the interesting part is not the selection logic itself. It’s the dependency graph that selection relies on.

Why a Dependency Graph

The premise of selective test execution is straightforward. If a code change only touches files A and B, you only need to run tests that exercise code paths touching A or B. Everything else passes by assumption. This sounds almost too obvious, but it reduces a suite that takes hours to one that might take minutes, without changing what gets verified.

To do this correctly, you need a mapping: for each source file or module, which tests could possibly fail if that file changes? Computing this mapping is the dependency graph problem. Build tools like Bazel and Buck2 solve it by requiring developers to declare dependencies explicitly in BUILD files. When you change a library, Bazel knows exactly which targets depend on it and can limit test execution to just those targets. Google has been running this approach at enormous scale for over a decade, and it works cleanly because the graph is maintained by the developers themselves as part of the build definition.

The tradeoff with BUILD files is maintenance overhead. Developers must keep them accurate, and the build system enforces correctness through hermeticity. Starting a greenfield project with this discipline is manageable. Retrofitting it onto an existing large codebase is a substantial engineering investment.

The Ruby Problem

Ruby makes dependency tracking hard in ways that languages with explicit module systems do not. The language is dynamic in a way that directly undermines static analysis. Methods can be defined at runtime. Classes can be reopened anywhere. require calls can be computed strings. The autoloader, whether classic Rails autoloading or the newer Zeitwerk, maps file paths to constants by naming convention, so a constant reference in code implies a file load without an explicit require statement visible at parse time.

This is not a corner case. It is how idiomatic Ruby code works. A file that calls User.find(id) depends on the definition of User, which lives in app/models/user.rb under Zeitwerk conventions. Tracking that dependency requires either understanding the autoloader’s rules or actually running the code.

Static analysis on Ruby is possible but requires significant investment. Sorbet, which Stripe developed and open-sourced, is the most sophisticated Ruby static type checker in existence. It understands enough of the language’s structure to construct a type graph across a large codebase, resolving constant references and method calls in a way that basic Ruby parsing cannot. This is not incidental to Stripe’s selective test execution work. Sorbet gives them a foundation for understanding what calls what that most Ruby teams simply do not have.

Packwerk, originally from Shopify and now widely adopted, adds another layer. It enforces package boundaries, making dependencies between domains explicit and recorded in package configuration files. Where Zeitwerk answers “what file does this constant come from,” Packwerk answers “which package boundary does this dependency cross.” Together, they give teams a layered view of the codebase’s structure that makes selective test execution tractable.

Static Versus Dynamic Approaches

There are two fundamentally different ways to build the file-to-test mapping, and both involve real tradeoffs.

Static analysis parses source code and constructs the dependency graph without executing anything. It’s fast to compute and cheap to update incrementally. The risk is incompleteness: dynamic patterns that static analysis cannot resolve produce gaps where real dependencies go untracked. A false negative here means a broken build makes it past CI. For Ruby without Sorbet, this risk is substantial. For a Sorbet-typed codebase, it becomes much more manageable because the type checker has already resolved the ambiguities that would otherwise be opaque.

Dynamic tracing instruments actual test runs and records which source files each test touched, typically through coverage tooling. SimpleCov in Ruby, Coverage.py in Python, or lower-level interpreter hooks. This approach captures everything static analysis misses. The cost is that you need to run all tests at least once to build the map, and the map must be maintained continuously as the codebase evolves. This is essentially what pytest-testmon does in Python: it records a coverage map per test case and uses that to determine which tests are affected by a given file change. It works well for teams that can afford the initial instrumentation run and keep the mapping database current.

Hybrid approaches use static analysis as the primary method and fall back to broader coverage when the graph is uncertain. You get most of the speedup from static analysis while limiting false negatives by erring toward inclusion when a dependency cannot be resolved statically.

The Safety Margin Problem

Any selective test system must answer a core question: when in doubt, skip or include? The conservative answer is always include. This preserves correctness but limits the speedup on ambiguous changes. The aggressive answer is skip, which maximizes speed but introduces the risk of missed failures.

Most production implementations add a confidence tier or a set of files designated to trigger a broader run regardless of their apparent footprint. Changes to widely-imported utility modules often cascade to a large fraction of the suite anyway, and some files, like test configuration or CI infrastructure code, are reasonable candidates for always triggering a full run. Getting this tiering right requires knowing your codebase’s structure, which is yet another reason that Packwerk’s package graph is useful: it provides a natural grouping for reasoning about blast radius.

There is also the question of granularity. If your dependency tracking works at the file level but a test file contains 200 test cases, you might run all 200 when only three are actually affected. Test-level coverage mapping, which records which individual examples touch which files rather than treating the test file as an atomic unit, is more precise but more expensive to maintain.

What Other Ecosystems Look Like

For comparison, Nx in the JavaScript and TypeScript ecosystem handles monorepo test selection by combining an explicit project dependency graph with git-diff analysis. Running nx affected --target=test computes which projects contain changes or depend on changed projects and runs tests only for those. It works cleanly because TypeScript imports are statically analyzable through the module graph. There is no dynamic dispatch problem to solve.

Turborepo takes a similar approach with task graphs and remote caching. Both tools benefit from the fact that JavaScript module resolution, while complex in its own ways, is largely amenable to static analysis.

The Ruby case is harder, which is why Stripe’s work is more interesting than comparable work in TypeScript-land. Getting selective test execution right in a dynamic language with 50 million lines of code, without missing real failures or running so broadly that selection provides no benefit, requires sustained tooling investment. The Sorbet and Packwerk infrastructure Stripe has built over years is not separate from this selective test execution work. It is the prerequisite that makes it feasible.

The Broader Lesson

Teams often treat CI speed as a tooling problem: faster machines, better parallelization, smarter queue management. Selective test execution is a different kind of intervention. It changes how much work needs to happen at all, not how fast the existing work runs.

The catch is that it requires an accurate model of your codebase’s dependency structure. For teams using a declarative build system, that model is maintained as part of the build definition. For teams in dynamic languages without that infrastructure, building the model is itself a significant project, one that compounds with every poorly-typed module and every metaprogramming pattern that hides its dependencies from analysis.

Stripe’s post is evidence of that investment paying off at real scale. It is also an honest picture of what the foundation for this kind of system looks like, and how much work came before the test selection part was even possible to build.

Was this interesting?