Review Chains Don't Add Overhead, They Multiply It

Avery Pennarun’s recent post argues that each layer of review in a software process makes you roughly ten times slower. The claim sounds provocative, but the underlying mechanism is real and well-studied. The interesting part isn’t whether the number is precisely ten. The interesting part is that the relationship is multiplicative, not additive, and most engineering culture treats it as the latter.

The Queue Nobody Draws on the Whiteboard

When teams diagram their development process, they usually draw boxes and arrows: write code, open PR, get review, merge, deploy. The arrows look lightweight. What the diagram omits is that each arrow is actually a queue, and queues have wait times that swamp the processing time.

Little’s Law, from queueing theory, states that the average number of items in a stable system equals the average arrival rate multiplied by the average time an item spends in the system. Rearranging: the average time in the system equals the average queue length divided by the throughput. When you add a review stage, you don’t just add the reviewer’s reading time. You add the time your change spends waiting for a reviewer to become available, plus any back-and-forth cycles, plus the time the reviewer spends context-switching into your codebase.

A 2019 Microsoft Research study measuring actual code review at Microsoft found that the median time from PR creation to first response was over six hours, and the median time to merge was over a day and a half. That’s for a single review layer. Add a second required review, or a separate security sign-off, and each additional layer multiplies the wait because the queues are largely independent and your change sits in all of them simultaneously until each clears.

The 10x framing is a useful heuristic, not a precise constant. The actual multiplier depends on reviewer availability, the parallelism of your queues, and how often revisions cycle back to earlier stages. But the direction is right: each layer multiplies rather than adds.

Why Organizations Keep Adding Layers

This is the organizational puzzle. If review layers slow things down exponentially, why do processes at most companies only ever add more of them?

The answer is that review layers are added in response to specific, visible failures. A security vulnerability ships; a security review layer gets added. A poorly-designed API gets merged; an architecture review gets added. Each addition is locally rational. The organization saw a real problem and responded with a checkpoint to catch it next time.

What doesn’t get measured is the compounding cost. The security incident is visible and logged. The features that took twice as long to ship because of the new review layer are invisible; they just look like normal delivery time. There’s no incident report for “we were 30% slower this quarter because of process overhead.” The cost is diffuse and counterfactual.

This asymmetry between visible failures and invisible costs means approval chains are almost always ratchets. They click forward with each incident but rarely click back.

The Trust Inversion

Many review layers exist because of a trust deficit. A mandatory security review on every PR implicitly says that the engineers writing the code cannot be trusted to think about security themselves. An architecture review gate says that individual contributors cannot be trusted to make structural decisions.

There’s a coherent argument for this in certain contexts. Highly regulated industries, safety-critical systems, and organizations where the consequences of mistakes are severe enough to justify the overhead all have reasonable cases for mandatory checkpoints. The problem is when the same patterns get applied uniformly across all work regardless of risk level.

Google’s internal code review culture is often cited as evidence that review doesn’t necessarily destroy velocity. And there’s something to that. But what makes Google’s code review function at scale is heavy investment in tooling. Critique, their internal review tool, is deeply integrated with automated analysis, test results, and change history. The review itself is informed by systems that catch the mechanical stuff, so human reviewers focus on higher-order concerns. The review layer is there, but it’s been engineered to minimize its queue time and reduce revision cycles.

Most organizations adopt code review without that investment. They get the overhead without the tooling that makes the overhead manageable.

What Actually Catches Bugs

Code review does catch bugs. Studies consistently show it catches defects that automated testing misses, particularly logic errors and design-level issues. A study by Capers Jones found that formal inspection processes remove roughly 60% of defects, which is higher than most testing phases alone.

But there’s a difference between “review catches bugs” and “every change needs multiple layers of human review before shipping.” The question is whether the review is targeted at the kinds of defects that actually occur and that other mechanisms can’t catch.

Pair programming, practiced seriously, is an alternative that front-loads the review cost. Rather than a change sitting in a queue for 24 hours, two people review it as it’s written. The total review effort is similar, but the feedback cycle is instantaneous and there’s no queue. Kent Beck’s original writing on Extreme Programming was partly a response to this problem, though it’s rarely practiced as intended in most shops.

Automated static analysis, type systems, fuzz testing, and property-based testing all reduce the defect surface that human review needs to cover. A codebase with strong type discipline and comprehensive automated tests is a different review target than one without. Investing in those tools shifts the review calculus before you ever discuss how many human layers to require.

The Threshold Question

The practical engineering question isn’t “should we have review” but “at what granularity and for what classes of changes.”

Chrome’s Gerrit-based workflow differentiates between owners and reviewers, and change complexity determines how many are required. The Linux kernel’s maintainer hierarchy routes patches based on subsystem and risk. Both of these are attempts to match review depth to actual risk rather than applying uniform policy.

Feature flags decouple code review from deployment risk. If a half-baked feature is hidden behind a flag that’s off for all users, the blast radius of merging it is near zero. This lets the review cycle focus on code quality and correctness in a lower-stakes environment, and reduces the pressure to make review thorough enough to gate deployment.

Threshold-based deployment systems like canary releases and percentage rollouts serve the same function at the deployment layer. They let you ship to 1% of users before 100%, which catches integration problems that review doesn’t, without requiring the review layer to do impossible things.

The Bureaucracy Attractor

Organizations under pressure tend toward more process, not less. When something goes wrong, adding a review layer is easy, visible, and defensible. Removing one requires someone to own the risk that whatever the layer was catching might slip through.

This is partly why Pennarun’s framing matters. Most arguments against heavyweight review talk about developer happiness or velocity in vague terms. The multiplicative math makes the cost legible. If you have four independent review stages and each one doubles the cycle time (optimistically), a change that would take two days without review takes 32 days with all four. That number is recoverable from actual ticket data. You can measure it.

Measuring it is what creates the leverage to have a real conversation about which layers are earning their cost. Some will be. Security reviews that catch critical vulnerabilities in sensitive systems earn their overhead. Architecture reviews that prevent five years of technical debt earn theirs. But the review layers that mostly generate rubber-stamp approvals, that get bypassed in emergencies anyway, that apply the same overhead to a CSS change as to a payment system rewrite, those have a legible cost and a negligible benefit.

The math doesn’t say reviews are bad. It says that each one needs to justify its multiplier, and most organizations have never done that accounting.