Stripe’s write-up on selective test execution describes a system that reduces CI time from hours to minutes on a 50-million-line Ruby monorepo. The concept is familiar: figure out which tests are actually affected by a given change, skip the rest. What the post leaves underspecified is the low-level machinery that makes it work. The critical enabling primitive is a single Ruby method and a flag most developers have never used.
The Coverage Module and Why Most People Use It Wrong
Ruby’s standard library ships a Coverage module that records which lines of code execute during a program run. Most teams encounter it through tools like SimpleCov, which wraps it to produce HTML reports for humans. That use case measures execution counts: how many times did this line run? The default mode updates a counter on every line execution, which is expensive.
For test impact analysis, you do not care how many times a line ran. You care whether it ran at all. Ruby 2.6 added oneshot_lines: true, a mode that records only whether each line was hit, not how many times, and stops tracking it once it has been hit. The runtime overhead drops substantially because the interpreter stops updating a counter and instead flips a bit once and ignores subsequent hits.
# Default mode: meaningful overhead on large suites, records execution counts
Coverage.start
# oneshot_lines mode: significantly lower overhead, records only hit/not-hit
Coverage.start(oneshot_lines: true)
The difference matters at scale. Coverage instrumentation in default mode can add 20-40% overhead to a test run. On a two-hour suite, that is nearly an hour of extra CI time to collect the data you need to make the next run faster. The oneshot mode brings that overhead to a range where a periodic full-coverage rebuild becomes operationally feasible as a scheduled job rather than a per-commit tax.
Building a Per-Test Coverage Collector
The fundamental data structure in any coverage-based test selector is a map from test file to source files touched. You build it by running each test in isolation with coverage enabled and recording what it loaded.
# collector.rb
require 'coverage'
require 'json'
class TestCoverageCollector
def initialize(output_path)
@output_path = output_path
@map = {}
end
def run_test(test_file)
Coverage.start(oneshot_lines: true)
begin
load test_file
ensure
result = Coverage.result
# result is a hash: { 'path/to/source.rb' => [nil, 1, nil, 1, ...] }
# With oneshot_lines, each entry is 1 if the line was hit, nil otherwise
touched_files = result
.select { |_file, lines| lines.any? { |l| l == 1 } }
.keys
.reject { |f| f == test_file }
@map[test_file] = touched_files
end
end
def save
File.write(@output_path, JSON.generate(@map))
end
end
This is essentially how Crystalball, Toptal’s open-source Ruby test impact analysis gem, works at its core. The gem adds RSpec integration, smarter file path normalization, and persistent storage, but the Coverage module call is the center of gravity.
The resulting serialized map, once built across the full test suite, maps each test to the set of source files it loaded. A typical entry might look like this:
{
"spec/services/payment_processor_spec.rb": [
"app/services/payment_processor.rb",
"app/models/payment_method.rb",
"lib/stripe/client.rb",
"app/models/concerns/auditable.rb"
]
}
Inverting the Index
The map above answers the question “what files does test T touch?” For test selection, you need the inverse: “what tests touch file F?” That requires a second pass to invert the relationship.
def build_inverted_index(coverage_map)
inverted = Hash.new { |h, k| h[k] = [] }
coverage_map.each do |test_file, source_files|
source_files.each do |source_file|
inverted[source_file] << test_file
end
end
inverted
end
With the inverted index available, test selection for a given git diff is a lookup, not a scan:
def tests_for_diff(changed_files, inverted_index)
changed_files
.flat_map { |f| inverted_index[f] }
.uniq
end
The inverted index is the artifact that CI workers fetch at the start of each PR build. At Stripe’s scale it covers millions of (source file, test file) pairs and is stored compressed in shared cache, rebuilt nightly rather than on every commit. The rebuild amortizes the collection overhead across time: you pay the instrumentation cost once per day on a dedicated worker, and every PR benefits from the result.
Transitivity Comes for Free
One underappreciated property of coverage-based collection is that it handles transitive dependencies automatically. When a test loads payment_processor.rb, and payment_processor.rb requires ledger.rb at the top of the file, the Ruby runtime loads ledger.rb before any test code runs. Coverage.result will include ledger.rb in the test’s coverage record even though the test never directly referenced it.
This means the inverted index entry for ledger.rb will contain all tests that transitively depend on it through any require chain, without any explicit graph traversal code on your part. The runtime’s module loading is doing the transitive closure computation for you.
The limitation is autoloading. Zeitwerk, the autoloader used in modern Rails applications, loads files lazily when their corresponding constant is first referenced. If a test never exercises the code path that triggers a particular constant lookup, the corresponding file never loads, and the coverage map has no entry for it. A test that tests the happy path of PaymentProcessor#charge will miss dependencies that only load on the error path.
This is the precise gap that Stripe’s Sorbet integration closes. Sorbet’s type-aware call graph can resolve method calls in typed code at analysis time without executing them, so it captures method-level dependencies that coverage collection would miss because the relevant code path was never exercised. Coverage handles the untyped portions; Sorbet handles the typed ones with higher precision. The two approaches are complementary rather than redundant.
The Staleness Engineering
The coverage map is correct at the moment it was recorded. It is potentially wrong from the moment the codebase changes. How wrong depends on how much the code has changed and whether those changes altered dependency relationships.
Three categories of change create specific problems:
New source files have no entries in the inverted index. A file that didn’t exist during the last coverage rebuild won’t appear as a dependency of any test. The safe policy is to always include all tests for new source files, or fall back to all tests in the same directory as the new file.
Changes to shared infrastructure are the most dangerous. Files loaded by virtually every test, shared helpers, database setup, factory definitions, Rails initializers, will appear in nearly every test’s coverage record. If one of these files changes, the inverted index correctly says “run every test,” which defeats the point of selection. The practical fix is to maintain an explicit list of paths where any change unconditionally triggers the full suite, bypassing the index entirely. These lists require periodic auditing as the codebase evolves.
File moves and renames break the index silently. The inverted index keys on file paths; a rename creates a new path with no history and leaves the old path pointing at a file that no longer exists. The safe policy treats renamed files as new files: run all tests for the new path unconditionally until the next rebuild captures its coverage.
A freshness window is the simplest staleness control: if a test’s coverage map entry is older than N commits or T hours, include the test unconditionally. Setting this to the time between nightly rebuilds means PR CI is never working with data more than 24 hours stale. Tighter windows require more frequent rebuilds; looser windows accept more risk of a stale entry missing a real dependency.
Safety Valves and Fallback Logic
The correctness guarantee of selective test execution is asymmetric. Running a test that didn’t need to run wastes compute and time. Skipping a test that should have run allows a regression through. Production implementations lean heavily toward conservative fallback to avoid the second failure mode.
A well-designed system falls back to running the full suite when:
- Any changed file appears on the infrastructure paths list (shared helpers, Gemfile, Gemfile.lock, initializers)
- The coverage map for a changed file is past the freshness window
- The total number of changed files exceeds a configured threshold, suggesting a large refactor that may have dependencies the coverage map underrepresents
- A changed file has no coverage data at all, meaning it was added or significantly restructured since the last full rebuild
- The fraction of tests selected by the index is above a high confidence threshold, say 90%, where the compute savings are small enough that a full run is worth the safety margin
This last heuristic is worth noting. If selective analysis says 92% of tests should run, you have paid the selection cost and have very little to show for it. Running 100% removes the tail risk of that 8% containing something the coverage map missed, at a cost of running a few extra percent of tests.
What Crystalball Gives You Today
For Ruby teams that want to apply this approach without building distribution infrastructure from scratch, Crystalball packages the core mechanism. It integrates with RSpec via a custom runner, records per-example-group coverage using the Coverage module, and serializes maps to disk. After a full run to build the map, subsequent invocations filter specs by changed files:
# In spec_helper.rb
require 'crystalball'
Crystalball::MapGenerator.start! do |config|
config.register Crystalball::MapGenerator::CoverageStrategy.new
end
bundle exec crystalball --diff HEAD~1
Crystalball does not include distributed cache storage, CI pipeline integration, Sorbet-enhanced call graph analysis, or the staleness management infrastructure. Those require engineering in your specific environment. But the gem demonstrates that the fundamental mechanism works for typical Ruby and Rails projects, not just at Stripe’s scale.
For Python teams solving the same problem, pytest-testmon implements an equivalent approach using coverage.py with a local SQLite backing store. The algorithm is identical; the implementation uses Python’s coverage infrastructure rather than Ruby’s.
The Coverage Module Is the Easy Part
The gap between “this works in a demo” and “this works in production CI for a 50-million-line codebase” is almost entirely operational. Collecting coverage per test takes an afternoon. Building and serializing the inverted index takes another afternoon. Handling staleness, infrastructure file invalidation, file renames, new files, autoloading edge cases, and Zeitwerk-loaded constants takes months of observation and refinement on a production codebase.
The Stripe write-up is useful precisely because it reveals which parts of the problem are solved by the Coverage module, which are solved by the Sorbet type graph, and which require ongoing operational maintenance. The oneshot_lines flag is the entry point. The rest of the system is built on top of understanding exactly when and how coverage data becomes wrong.