· 7 min read ·

A Slow Test Suite Is a Coupling Report

Source: lobsters

The lesson that gets the least attention in most monolith scaling advice is also the most diagnostic: test suite speed is a structural metric. It appears in Isaac Lyman’s 113-lesson distillation of scaling a Rails monolith to one million lines, alongside the more prominent themes of module boundary enforcement and database migration discipline. It deserves its own examination because it connects code structure, developer behavior, and long-term architectural health in a way that most other metrics do not.

The Visible Symptom

A test suite running 3 minutes at 10,000 lines of code often runs 40 minutes at 500,000. The standard interpretation is that volume grew, so duration grew. Add more parallel runners, buy faster CI machines, shard the test database. Teams do this. It helps. The tests still take 40 minutes.

The wrong interpretation is that this is a CI infrastructure problem. The right interpretation is that 40 minutes reflects how much of the system each test implicitly depends on. A test covering a discount calculation should not need a web server, a real database, three background job workers, and a Redis connection. If it does, that is because the discount calculation code is woven through enough layers that none of them can be exercised in isolation.

Parallel runners make a 40-minute suite run in 8 minutes. That is a real improvement. But the test at minute 38 that exercises one API endpoint while loading eight unrelated models, sending a webhook, and writing an audit log is still there. Parallelization distributes the symptoms rather than addressing their source.

The Design Pattern Behind Fast Tests

Gary Bernhardt’s “Boundaries” talk from SCNA 2012 named the structural pattern that produces fast tests: functional core, imperative shell. The core domain logic lives in pure functions or objects that take values and return values, with no I/O, no database calls, no network dependencies. The shell handles the wiring, the persistence, the external service calls. The shell is thin and mostly tested through integration tests at genuine boundaries; the core is dense with behavior and fully testable without any infrastructure.

# Slow: requires database, AR model load, complex fixture state
RSpec.describe OrderController, type: :controller do
  it "applies loyalty discount" do
    user = create(:user, loyalty_tier: :gold)
    product = create(:product, price: 100)
    post :checkout, params: { user_id: user.id, product_id: product.id }
    expect(response.parsed_body["total"]).to eq(80)
  end
end
# Fast: pure function, no infrastructure required
RSpec.describe Pricing::LoyaltyDiscount do
  it "applies 20% discount for gold tier" do
    result = described_class.apply(base_price: 100, tier: :gold)
    expect(result).to eq(80)
  end
end

The second test runs in under a millisecond. It will run ten thousand times in the span a single controller spec takes. It is also more focused: if pricing logic breaks, it fails. If the database is slow or a factory is misconfigured, it does not fail, because it never touched those systems.

The problem is that most code in a growing monolith does not look like the second example, because writing code that way requires a deliberate architectural decision about where business logic lives. ActiveRecord models that do computation inside callbacks, controllers that inline business rules, service objects that query the database partway through a calculation: all of these make the functional core unreachable without the infrastructure.

The Behavioral Feedback Loop

Test speed is an architectural metric rather than just an inconvenience because of what it does to developer behavior over time.

When tests run in 90 seconds, engineers run them constantly: before committing, after a refactoring, while debugging. The feedback is tight. Refactorings get validated quickly. Small improvements to code structure are cheap to verify.

When tests take 30 minutes, engineers stop running them locally. They push to CI and check something else while they wait. Code review happens before CI feedback arrives. The feedback loop has expanded from seconds to hours. Engineers stop refactoring as frequently, because refactoring without fast verification is risky. They become conservative around code they did not write. They add to the pile rather than improving the structure underneath.

This is the compounding problem. Slow tests reduce refactoring. Reduced refactoring means the coupling that caused the slowness does not get cleaned up. More code gets added on top of existing coupled structures, extending the problem. The test suite gets slower. The cycle tightens.

At 50,000 lines, this is a minor inconvenience. At 500,000 lines, with three years of conservative development accumulated on top of coupled code, extracting the functional core is a multi-month project.

What Tooling Can and Cannot Do

There are genuine tooling improvements worth making. parallel_tests for Ruby distributes test files across multiple processes, each with its own database connection. pytest-xdist provides the same for Python. Gradle’s --parallel flag enables concurrent module builds for Java. These reduce wall-clock time and are worth deploying.

For Rails specifically, there are tools that target the underlying coupling problem rather than just distributing it. Bullet detects N+1 query patterns and runs in test mode, surfacing the database calls that accumulate from lazy association loading. ActiveRecord’s strict_loading configuration, available since Rails 6.1, raises an error when an association is loaded lazily rather than explicitly, forcing you to be deliberate about what data a test requires.

# config/application.rb
config.active_record.strict_loading_by_default = true

This breaks tests that relied on lazy loading, which is the point. Each breakage surfaces an implicit database dependency that was previously invisible.

Mutant for Ruby and Stryker for JavaScript and TypeScript provide mutation testing: automated modification of source code to verify that your assertions catch real behavioral changes. A test suite with 95% line coverage that passes when you flip a comparison operator in a pricing function is not testing that function in any meaningful sense. Mutation testing makes this visible by generating hundreds of small code mutations and checking which ones your test suite detects. A mutation survival rate above 20% in core business logic is an architectural warning sign.

None of these tools fix the fundamental problem, which is business logic embedded in layers that require infrastructure. That fix is design, done incrementally: identify the pure computations inside a model or controller, extract them into a standalone class, write unit tests for that class directly. Repeat. Over time, the core/shell boundary clarifies, and the fast test suite follows from the structure rather than from parallelization.

The Coverage Metric That Misleads

Lyman’s account of the journey to 1M LOC contains a specific observation worth taking seriously: 100% line coverage is nearly useless at scale. It tells you every line was executed during the test run, not whether the assertions caught real bugs.

A test that loads a full user profile, submits a form, and checks for a 200 response has executed every line in the path. It would pass if the discount calculation returned the wrong value, as long as the response status code was correct. The coverage metric reports 100% and lies about what that means.

The metric that carries actual information is behavioral coverage: whether a change to the code’s behavior causes a test failure. Mutation testing approximates this. Line coverage does not. Teams that optimize for line coverage at scale often end up with large test suites that take a long time to run and catch fewer bugs than the number suggests, because the tests execute code rather than verifying behavior.

Treating Duration as a Hard Constraint

The practical implication of treating test suite duration as an architecture metric is that it should have a budget, enforced in CI, that fails the build when exceeded. This is not a common practice. Most teams track test suite duration informally and address it when it becomes painful enough. By that point, the coupling that caused the slowness has usually been present for years and accumulated additional dependent code.

The threshold has to be realistic for the codebase’s current state, but the principle is that adding a feature should not be allowed to increase test suite duration by 15% without investigation. If a PR causes a meaningful duration increase, that is a signal about coupling in the new code, and reviewing it before the pattern propagates is far cheaper than addressing it later.

This discipline is easier to establish early and hard to retrofit. At 10,000 lines, setting a 60-second budget for the unit test suite is straightforward. At 1,000,000 lines, enforcing that budget requires the structural work to have already happened.

The Broader Signal

The test suite is a special architectural signal because it is continuously visible. Module boundary violations require a separate static analysis pass to surface. Database migration risks require domain familiarity to identify. Test suite duration shows up on every CI run, in a number every developer sees on every PR.

For teams earlier in the growth curve, interpreting that number as structural feedback, rather than an infrastructure problem to throw compute at, is among the highest-leverage disciplines available. The question worth asking when tests are slow is not how to run them faster, but why any given test requires so much infrastructure to exercise so little behavior. The answer to that question points at the coupling that will limit the codebase’s long-term maintainability at any scale.

Was this interesting?