The Math Behind Why Your Third Review Stage Is Your Most Expensive

The math of software delivery pipelines is counterintuitive in a specific way. Engineers think of review stages as additive: one review costs two hours, two reviews cost four, three reviews cost six. Avery Pennarun’s recent post argues something darker, that each layer of review multiplies your slowdown rather than adding to it, and that the multiplier is roughly 10x per layer. That claim sounds like an exaggeration until you look at how queues behave under load.

The Queue Math

Little’s Law, formalized by John Little in 1961, states that the average number of items in a stable system equals the arrival rate multiplied by the average time each item spends in the system: L = λW. This is useful for steady-state analysis, but it doesn’t capture what happens as utilization increases.

The M/M/1 queue model does. For a single-server queue with random arrivals and exponentially distributed service times, the expected wait time before service begins is:

W_q = ρ / (μ(1 - ρ))

where ρ is server utilization (arrival rate divided by service rate) and μ is the service rate. At 50% utilization, wait time equals service time. At 80% utilization, wait time is four times service time. At 90%, it is nine times. The curve is hyperbolic, not linear, and it approaches infinity as utilization approaches 100%.

Apply this to a senior engineer who splits time between writing code and reviewing it. If they spend 80% of their time producing and 20% reviewing, their utilization as a reviewer is already in the regime where queue times are growing rapidly. A 30-minute review task generates hours of queue wait, not minutes, because the reviewer is rarely available immediately.

Add a second required review stage and you are not adding another 30-minute wait. You are adding a second queue, each with its own utilization profile and its own hyperbolic wait curve. The stages do not compose linearly; they compound.

Rework Cycles Make It Worse

The M/M/1 model covers a single queue and assumes each item passes through cleanly. Code review is worse because failed reviews trigger rework cycles. A PR that comes back with change requests re-enters the queue, meaning it incurs the queue wait multiple times.

Consider a three-stage pipeline: peer review, tech lead sign-off, security review. If each stage has a 30% probability of requesting changes, the probability a PR makes it through all three without a return trip is roughly:

0.7 × 0.7 × 0.7 ≈ 0.34

About two-thirds of PRs cycle back through at least one stage. Each cycle means re-entering a queue, plus context-switching costs for the author, who has likely moved on to other work in the meantime. Reconstructing the mental state needed to address review feedback is not free, and it scales poorly with the time elapsed since the original PR was written.

Multiply several rework cycles through several queues and the 10x figure starts looking conservative for pipelines operating at high utilization.

Where the Layers Come From

The question of where these review layers come from is worth addressing, because removing them requires understanding how they accumulated.

The mechanism is almost always the same: an incident occurs, a post-mortem identifies a gap in review coverage, and a new required sign-off is added. Nobody removes an old requirement. Review requirements accumulate because adding one is a locally sensible response to failure, while removing one requires admitting that the associated risk was acceptable all along, which is politically difficult.

This creates a ratchet structure. Organizations trend toward more review gates over time regardless of whether each gate delivers proportional value, because the incentives are asymmetric: the person who adds a gate gets credit for caution; the person who removes one takes on personal liability for any subsequent incident in that area.

The DORA State of DevOps research, which Forsgren, Humble, and Kim synthesized in Accelerate, has been tracking this dynamic for years. Elite-performing teams consistently have both lower change failure rates and higher deployment frequency. The two do not trade off the way the review-as-safety intuition implies. More gates produce slower outcomes with roughly the same defect rate, because the failure modes shift from bugs reaching production to changes never shipping at all.

The Bystander Problem

There is a second mechanism that makes multi-reviewer configurations worse than the queue math alone predicts. When a PR requires approval from multiple reviewers, each reviewer’s sense of responsibility diffuses across the group.

This is a well-studied phenomenon. Darley and Latané’s 1968 research on diffusion of responsibility showed that the more bystanders observe an emergency, the less likely any individual is to respond, because each assumes someone else will act. The same dynamic applies to PR queues: a review request sent to three people will often wait longer than one sent to a single assigned reviewer, because each of the three assumes the others will handle it.

Google’s engineering practices documentation addresses this directly by recommending a single primary reviewer in most cases, with the expectation that the review begins within one business day of submission. That target becomes structurally harder to hit as reviewer counts increase, regardless of how much combined reviewer availability exists in the group.

Throughput Hides the Problem

Engineering organizations typically measure velocity in throughput terms: story points per sprint, features shipped per quarter, PRs merged per week. These metrics are insensitive to the latency component that review layers introduce.

A team shipping 20 features per month with a mean PR cycle time of three days is not equivalent to a team shipping 20 features per month with a mean cycle time of one day. The second team can respond to production incidents, experimental results, and customer feedback roughly three times faster. The compounding advantage of lower cycle time over a year is substantial and entirely invisible to throughput metrics.

This is why the Theory of Constraints, as Goldratt described it, applies cleanly here. The constraint in a heavily reviewed pipeline is rarely the writing of code; it is the queue wait at each stage. Improving the non-bottleneck, whether that is coding speed, tooling, or automation, has no effect on cycle time when queue wait is the actual bottleneck. Only removing or reducing the bottleneck stage itself produces meaningful improvement.

What Review Is Actually For

The implicit model behind heavy review processes is that reviewers catch bugs before they reach production. The empirical evidence is more complicated. Studies of code review effectiveness, including research conducted at Microsoft, find that review primarily catches style issues, knowledge-sharing gaps, and straightforward logic errors. It rarely prevents the deep concurrency bugs, data races, or subtle security vulnerabilities that cause significant incidents. Those failure modes require fuzzing, formal analysis, runtime monitoring, or staged rollouts, none of which are code review.

This does not mean review is worthless. It means the marginal value of a second or third review stage is low relative to the marginal cost, because cost grows hyperbolically with utilization while the benefit of additional coverage is roughly linear with the number of eyes on a diff.

The practical conclusion from the queuing math is that removing a review layer is worth far more than the time saved at that stage suggests. If your pipeline is operating at high utilization, reducing the number of review stages shifts reviewers’ time back to the remaining stages, lowering their utilization, and potentially moving them from the steep part of the hyperbolic curve to the flat part. That nonlinear improvement in queue wait time is why each review stage costs more than the one before it, and why eliminating one stage produces returns that look disproportionate to the apparent cost saved.

The 10x claim is not a rhetorical number. It is a rough description of what the M/M/1 curve does to queue times in the utilization range where most review-heavy pipelines operate. The teams that understand this tend to arrive at the same set of practices: single primary reviewer, fast turnaround target, automated checks handling the broad coverage that humans handle poorly anyway, and an explicit commitment to measuring cycle time rather than just throughput. The math does not leave much room for a different conclusion.