The Compounding Math Behind Sequential Code Review

The claim in Avery Pennarun’s recent post is uncomfortable but mathematically grounded: every layer of review you add to a process does not just add overhead, it multiplies total latency. Not 1.5x, not 2x. Something closer to 10x per layer, if your reviewers are busy people.

The math is not complicated, but it is easy to miss because humans think about delays additively. You have a two-hour review here, a four-hour review there; add them up, you have added six hours. That intuition is wrong.

Queues compound, not add

The right model for a sequential review process is a series of M/M/1 queues. The key insight from Little’s Law is that waiting time in a queue is not just a function of how long the reviewer takes, but how busy the reviewer is.

For a single M/M/1 queue, average wait time is:

W = 1 / (μ - λ)

Where μ is the service rate (reviews per hour) and λ is the arrival rate (items per hour). As utilization ρ = λ/μ approaches 1, wait time approaches infinity. A reviewer at 90% utilization introduces nine times more waiting than a reviewer at 50% utilization, even if they spend exactly the same time on each review.

Now chain two reviewers together. The expected wait for the full pipeline is the sum of the waits at each stage. But the compounding effect is worse than simple addition: output variance from the first queue creates burstiness in the second. This is variability propagation, a concept formalized in Factory Physics for manufacturing systems. In practice, two sequential queues of 90%-utilized reviewers do not produce 2x the wait of one. They produce considerably more.

The 10x figure in Pennarun’s title is not rhetorical. It is the rough empirical result when you model realistic reviewer utilization and account for context-switching overhead on both sides of the review.

What each layer is actually protecting

Review layers do not appear out of nowhere. Each one was added because something went wrong, or because someone believed something might go wrong. Understanding what each layer protects matters, because the alternatives depend on the failure mode.

Correctness review catches logic errors before they ship. Static analysis and automated testing cover a large fraction of the space, but not all of it. A second set of eyes on a subtle concurrency bug or an intricate algorithm has genuine value.

Security review is different. Most organizations treat it as a gating function, but the actual purpose is threat modeling, not line-by-line correctness. A security engineer reviewing ten pull requests a day for correctness is doing it wrong. The better model is threat modeling at design time and automated scanning at commit time, with security review reserved for architecture changes and threat model updates.

Knowledge dissemination is perhaps the most overloaded purpose assigned to code review. The theory is that reviewers learn the code as they review it, spreading knowledge across the team. In practice, reviewers under time pressure skim changes rather than study them. The SmartBear research on code review found that beyond 400 lines of changed code, defect detection rates drop sharply even as review time increases. Reviewers are pattern-matching against their mental model of the system, not deeply reading new code. Knowledge sharing is better served by architecture decision records, documentation, and structured pairing.

The DORA research confirms the direction

The DevOps Research and Assessment program, whose findings are compiled in Nicole Forsgren’s Accelerate, tracked engineering teams across thousands of organizations over several years. Elite performers, defined by deployment frequency, lead time, change failure rate, and recovery time, share a consistent structural property: they review smaller changes with fewer people in shorter cycles, not more.

Elite teams deploy multiple times per day with lead times measured in hours. High-burden review cultures cluster in the medium and low performer categories, not because review is inherently harmful, but because their review processes catch errors late rather than prevent them early.

The research also found that trunk-based development, where engineers commit to the main branch directly or use extremely short-lived feature branches, correlates strongly with elite performance. Smaller, more frequent changes are easier to review correctly, faster to integrate, and produce less merge debt. The review burden goes down because the unit of review shrinks.

The Google case study

Pennarun worked at Google, which is worth keeping in mind when reading his argument. Google requires code review for every change. This sounds like a contradiction of his thesis, but it is not, because the constraint at Google is not the existence of review but the structure of it.

At Google, code review is well-documented and deliberately constrained: one reviewer, one author, one approval, fast turnaround. The reviewer is typically a peer, not a manager or a specialist gatekeeper. The emphasis is on small, focused changes and rapid feedback cycles.

What Google does not have is the three-stage review process common at large enterprises, where a pull request needs sign-off from a tech lead, then a senior engineer, then a platform team owner, then a security team member, potentially in different timezones with different priorities. That structure produces the 10x slowdown. One-reviewer code review, done with discipline, does not.

What the alternatives actually require

Reducing review layers is not the same as eliminating review. The question is what structure of review delivers error-catching value with less queuing overhead.

Smaller changes are the most powerful lever available. A 50-line diff reviewed by one person in fifteen minutes contains less accumulated risk than a 500-line diff reviewed by four people over a week. The former is mergeable the day it is written. The latter has integration risk, merge conflicts, and cognitive load spread across four reviewers. Shipping in smaller increments requires discipline, but it pays compounding returns on cycle time.

Pair programming converts async review into synchronous review with zero queue time. It is not universally faster on a per-feature basis, but it eliminates review latency entirely. For teams where review bottlenecks dominate cycle time, the tradeoff deserves serious examination.

Pre-merge automation handles a significant portion of what reviewers actually catch. Linters, static analyzers, type checkers, security scanners, test coverage requirements, and fuzzing address the deterministic errors. What remains for human review is genuinely ambiguous judgment, which tends to be a much smaller surface area than the full diff.

Trust and autonomy is the uncomfortable one. Many review chains exist because the organization does not trust the engineer to make good decisions alone. That is sometimes appropriate, in regulated industries or safety-critical systems. But it is often organizational scar tissue from a bad decision made years ago by someone who no longer works there. The review layer stayed; the original justification did not.

The organizational dynamics of accumulation

Review layers accumulate in one direction. When something goes wrong, the response is almost always to add a review step. When things go well, nobody removes a step. Over years, this produces processes where a configuration change requires five approvals from people spread across three timezones.

This connects directly to Goodhart’s Law: when review sign-off becomes the metric for quality, reviewers optimize for completing reviews, not for the quality outcomes the review was meant to produce. Fast, superficial approvals are rational behavior for a reviewer facing a full inbox. The review layer exists on paper; the scrutiny does not.

The 10x figure is provocative but grounded. Queuing theory supports it. The empirical research on delivery performance supports it. The question for any team is not whether the overhead is real, but whether the specific errors caught by each review layer are worth the accumulated latency cost of that layer. For most teams, some layers are worth it and some are not. Almost no teams have done the analysis to know which is which.