Code Review Is a Queue, and Queues Have Physics

The claim in Avery Pennarun’s post is deceptively simple: each layer of review doesn’t slow you down by a fixed cost, it multiplies the existing cost by a factor. Run enough layers and the math becomes brutal. This isn’t organizational psychology or a management opinion. It’s queuing theory, and queuing theory is as indifferent to your process philosophy as gravity.

The math reviewers don’t talk about

A single reviewer is a service node. They receive requests, process them, and release them downstream. This maps directly onto an M/M/1 queue from operations research: arrivals follow a Poisson process, service times are exponentially distributed, one server. The average time a job spends waiting in that queue is:

W_q = ρ / (μ(1 - ρ))

Where ρ is utilization (the fraction of time the reviewer is busy) and μ is the service rate. At 50% utilization, your wait equals the service time. At 80%, your wait is 4x the service time. At 90%, it’s 9x. At 95%, it’s 19x. These aren’t worst-case scenarios; they’re the expected values under steady-state load.

Now chain two reviewers together, each at 80% utilization. Your total wait is not 4x + 4x = 8x. You multiply the delays through the pipeline: 4x × 4x = 16x. Three reviewers at 80% gives 64x. Four gives 256x. The “10x per layer” framing is generous if your reviewers are running anywhere near full capacity, which engineers at most organizations are.

This is why adding a required approver to a PR template has a nonlinear effect on lead time. You’re not adding a queue; you’re multiplying queues.

Where the empirical research lands

The DORA State of DevOps research has been tracking software delivery performance since 2014. One of its most durable findings is that elite performers, those who deploy multiple times per day with low change failure rates, share a structural property: they have dramatically shorter review cycles than low performers. Not faster reviewers. Shorter review pipelines.

The four key DORA metrics are lead time for changes, deployment frequency, change failure rate, and time to restore service. The research consistently shows that lead time is a stronger predictor of organizational health than failure rate. Teams that optimize for keeping lead time short tend to develop low failure rates as a byproduct. Teams that optimize for approval gates tend to develop both high lead times and high failure rates, because the delays force large batches, large batches create integration complexity, and integration complexity creates failures.

This is the result that organizations find hardest to accept. The intuition behind review gates is that more checking produces fewer errors. The data says the opposite holds at the system level. You get more errors, not fewer, when code accumulates behind gates.

Why gates accumulate anyway

Review requirements don’t appear randomly. Each new gate is a response to a specific past failure. Someone pushed broken code to production; now there’s a required CI check. A security bug merged without review; now there’s mandatory security sign-off for anything touching authentication. An untested service call caused a cascade; now architecture review is required for any new service dependency.

Each gate is individually rational. The incident that prompted it was real. But organizations have no mechanism for removing gates once added, because removal requires trusting that the original problem won’t recur, and the people who could authorize removal are often the same people who were burned by the original failure. The asymmetry of outcomes ensures accumulation: adding a gate has low immediate cost and visible risk reduction, while removing a gate has low immediate benefit and visible risk.

This is what makes review requirements a one-way ratchet. The incentive structure strongly favors addition and strongly resists removal, independent of whether any given gate is still delivering value proportional to its queuing cost.

The variability problem

The pure queuing math understates the practical impact because it assumes steady-state conditions. Reviewer availability isn’t steady-state. When a required approver is out sick, or deep in a critical incident, or context-switched into a quarterly planning cycle, your PR waits. The utilization ρ fluctuates unpredictably from the perspective of the person waiting.

This means the median wait time significantly understates what developers actually experience. The median is fine; it’s the 90th percentile that erodes velocity. A developer who submits ten PRs in a week and hits one that waits three days for reviewer availability has had a week where their throughput was effectively capped at a third of normal, despite working at full capacity the entire time.

Variability creates secondary costs. A PR waiting for review accumulates merge conflicts with main. A PR finally reviewed after two days may require significant rework; by then the author has lost context on why they made the choices they made. A stack of dependent changes has its total wait time multiplied by the chain length. These are real costs that don’t show up in anyone’s utilization numbers but dominate the lived experience of engineers trying to ship.

What the alternatives actually look like

The research-supported alternative isn’t no review. It’s shifting review to happen continuously rather than at a gate.

Pair programming is the oldest version of this. Review happens in real time, at the cost of some individual throughput but with queue depth reduced to zero. Code reaching main has already been reviewed. There’s no approval wait, no context loss, no merge conflict accumulation.

Feature flags decouple deployment from release. Code merges to main continuously, behind a flag, without needing to be production-ready in the sense of being user-visible. This removes the psychological weight from each review: the reviewer is no longer the last line of defense, because the change isn’t live. Review becomes faster and more focused when reviewers aren’t implicitly asked to bear full responsibility for correctness.

Automated test coverage addresses the “what if this breaks something” fear that drives review accumulation. Teams with high automated test coverage still review code, but their reviews shift from “is this correct” to “is this the right approach,” which is both faster and more valuable. Correctness checking by humans at review time is slow and unreliable compared to correctness checking by a test suite that runs in three minutes.

Smaller change sizes reduce review cost per review while reducing the blast radius of any individual change. The lean manufacturing insight applied to software: smaller batches flow faster. A 50-line change takes 10 minutes to review with high confidence; a 500-line change takes an hour with lower confidence. Smaller changes with fast review cycles outperform large changes with careful review on every metric DORA tracks.

The trust substrate

Underneath the queuing math is an organizational trust problem. Review gates accumulate where trust is absent. When a team trusts each other’s judgment, code moves fast. When trust is low, gates substitute for trust.

This creates a feedback loop that runs in the wrong direction. Teams with low trust get buried in process, which reduces throughput, which increases batch sizes, which increases the risk and frequency of incidents, which further erodes trust, which prompts more gates. The process meant to address a trust deficit tends to deepen it over time.

The teams that escape this are the ones that invest in making trust warranted rather than substituting process for it. Code style is enforced by formatters. Correctness is enforced by tests. Architectural conformance is enforced by linters and dependency rules. What’s left for human review is judgment about whether this is the right change to be making at all, and about the design decisions that no tool can evaluate. That’s a legitimate and valuable use of review bandwidth. Spending it on function naming conventions is not.

The 10x claim may feel like rhetorical rounding. The queuing math, run at realistic utilization numbers, suggests it’s a lower bound.