How Stripe's Type Checker Became Its CI Infrastructure

The algorithm underneath Stripe’s selective test execution system is not new. Rothermel and Harrold published the foundational theory of regression test selection in 1994, and the core logic has been stable since: when a set of files changes, compute the transitive reverse dependency closure and run only the tests within it. The problem was never the algorithm. For dynamic languages, the problem was always the dependency graph.

Stripe’s write-up on selective test execution describes a system that skips roughly 80 to 90 percent of tests on a typical pull request against a 50 million line Ruby monorepo. The implementation centers on Sorbet, their Ruby type checker, which turns out to produce exactly the dependency graph this kind of system requires.

Why Dynamic Languages Made This Hard

In a statically typed language, the dependency graph is a byproduct of compilation. The compiler resolves which files reference which symbols from which other files; extracting that information for test selection is a tooling problem, not a theoretical one. Ruby’s runtime dynamism removes that foundation.

A Ruby file can load another file using a string computed at runtime. It can reopen any class anywhere, adding methods that other code will call with no visible reference at the call site. Rails autoloading via Zeitwerk maps constant names to file paths by convention, so the dependency from referencing Order to loading app/models/order.rb is never written in the source. A naive require-graph analysis misses most of the real dependencies in a large Rails application.

Before type checkers entered the picture, selective test execution for Ruby was either impractical or deeply conservative. You could restrict to a well-understood scope and accept incomplete edges. Or you could add conservative edges whenever uncertain, which expands the affected set toward the full suite quickly enough to erase the savings. Neither approach produces an 80 to 90 percent reduction in tests run.

What Sorbet’s Analysis Provides

Sorbet is a static type checker for Ruby, developed at Stripe and open-sourced in 2019. Its usual justification is correctness: catch type errors before runtime, enable reliable editor tooling, make large-scale refactors safer to execute. The selective test execution system illustrates a different kind of value. When Sorbet analyzes a Ruby codebase, it resolves constant references, method calls, and module inclusions to their source locations. It tracks which file defines which class, which file extends which module, which file reopens which class. The output is a complete, file-scoped cross-reference index.

For test selection, inverting that index gives the reverse dependency graph: given any file, Sorbet can identify every other file in the codebase that depends on it. When a pull request changes a set of files, the system traverses this graph to compute the affected set and runs only the tests that fall within it.

This means Sorbet’s type coverage directly determines the accuracy of test selection. Where coverage is high, the dependency graph is precise and the affected set is tight. Where coverage is low, the system must add conservative edges, widening the selection. There is a measurable relationship between typing discipline and CI cost: more coverage, fewer unnecessary test runs.

Closing the Gem Boundary

Gems, Ruby’s external libraries, introduce a specific challenge. Sorbet cannot analyze gem internals without type information, and most gems are not typed in Sorbet’s format. Tapioca, a gem developed by Shopify for the Sorbet ecosystem, addresses this by generating RBI files (Ruby Interface files) that describe gem public APIs as Sorbet-readable stubs. These stubs let Sorbet reason about gem boundaries without analyzing gem source, allowing the dependency graph to track when application code depends on a gem interface.

A change to a gem version, or to a gem’s generated stub, propagates through the affected-file computation correctly. Without this layer, gem boundaries would be invisible to the selection system, and any gem update would require a conservative full run or risk missing affected tests.

Safety Guarantees and Conservative Edges

Rothermel and Harrold’s framework defines a safe regression test selection algorithm as one that never excludes a test that could detect a regression. Achieving that safety guarantee requires the dependency graph to be complete, or at minimum conservative: when a relationship is uncertain, include an edge rather than omit one.

Ruby has patterns that resist static resolution even under Sorbet. Dynamic dispatch via send, constant lookup via const_get, and eval-based metaprogramming cannot be resolved to specific call targets statically. Stripe’s system handles these by adding conservative edges for code that exhibits such patterns. The cost is false positives (running tests that were not strictly affected by the change), but the alternative is false negatives.

A second safety layer is scheduled full runs. Selective execution applies to pull requests, where fast feedback is the goal. On merges to the main branch, the full test suite runs regardless of what the selection algorithm would have chosen. This matches Microsoft’s approach with Test Impact Analysis for .NET in Azure DevOps, where nightly builds run all tests as a correctness backstop. Periodic full runs catch any regression that slipped through a gap in the dependency model, keeping the selective system honest.

How Other Approaches Compare

Bazel handles dependency tracking at the build system level: BUILD files declare explicit dependencies for every target, so the reverse dependency graph is always complete and accurate. Google’s CI systems have used this for over a decade to avoid running unaffected tests across their internal monorepo. The tradeoff is that BUILD files require explicit maintenance and are best suited to codebases designed around this tooling from the start.

Microsoft’s Test Impact Analysis for .NET takes a dynamic approach rather than a static one. During a full test run, it records which source files each test actually executed, building a test-to-source coverage map. On subsequent runs, it queries this map to find tests whose coverage overlaps the changed files. The system captures runtime-dynamic dependencies that static analysis would miss, but it requires a full run to initialize and the map grows stale as the codebase evolves between full runs. Microsoft reports 40 to 80 percent test reduction in typical .NET projects using this method.

ML-based tools like Launchable and Gradle Enterprise’s predictive test selection train on historical CI data to predict which tests are likely to fail given a set of changed files. These require no dependency graph, work across any language, and tolerate dynamic patterns naturally, but they produce probabilistic estimates rather than exact answers and need substantial historical failure data to reach useful accuracy.

Stripe’s approach sits between the build-system model and the dynamic tracing model. Sorbet provides a static graph derived from actual type information rather than explicit declarations, capturing real dependencies without requiring developers to maintain a separate metadata layer. The accuracy of the graph scales with type coverage rather than with a build specification that must be kept in sync independently of the source.

The Compounding Return on Typing Investment

The conventional case for typing a dynamic language codebase centers on correctness and developer experience: catch errors earlier, enable better tooling, make refactors safer to execute at scale. What Stripe’s CI system adds to that argument is less obvious but concrete: static types make a codebase mechanically legible to a broader class of automated tools.

Test selection is one instance of this. The same dependency graph that drives selective CI could drive automated impact analysis for infrastructure changes, large-scale migration tooling, IDE cross-reference features, and documentation systems. Each of these tools has a higher accuracy ceiling on a typed codebase because the dependency model is more complete. The type checker is not doing additional work for these tools; it is reusing an analysis it had to perform anyway.

Stripe’s Sorbet coverage built up over years under a policy of progressive adoption. The original justification was type safety. The CI speed improvement is a return on that same investment, one that was not necessarily the primary motivation when the typing work began. For teams evaluating whether to adopt Sorbet in a large Ruby codebase, this belongs in the calculation alongside the correctness and tooling arguments. The type checker adopted for one reason compounds into infrastructure that other systems depend on, and those secondary dependencies accumulate silently until the CI bill comes due.