· 6 min read ·

Review Doesn't Scale: The Compounding Math Behind Slow-Shipping Teams

Source: hackernews

Avery Pennarun has been writing about engineering process and organizational dysfunction for years, and his posts tend to age well. His latest, Every layer of review makes you 10x slower, is making the rounds on Hacker News with good reason. The core claim is that review layers don’t add latency linearly; they multiply it. Each gate compounds the previous one, and by the time you have four approval stages on a change, you’re not four times slower, you’re something like a thousand times slower.

It’s a provocative framing, but the underlying math is sound enough to take seriously.

Why Latency Compounds

The intuition most teams have is additive: one reviewer adds one day, two reviewers add two days, three reviewers add three days. This is wrong in almost every real-world case.

Review queues don’t drain in parallel by default. Each stage typically starts after the previous one completes. If your security team won’t look at a PR until the architecture review is signed off, and the architecture review waits until the initial code review is done, you have a serial pipeline. Serial pipelines have latency that is the sum of each stage’s wait time, multiplied by the probability of a revision cycle at each stage.

Queueing theory makes this worse. Little’s Law tells us that average latency in a queue equals average queue length divided by throughput. When a reviewer is 80% utilized (a common and seemingly reasonable target), queueing effects become severe. A reviewer at 80% utilization produces average wait times four times longer than a reviewer at 50% utilization. Add a second stage at similar utilization and you’ve multiplied that delay again. The 10x claim in the title isn’t hyperbole; it’s what falls out of the math when you model even modestly loaded review pipelines with two or three stages.

Donald Reinertsen’s Principles of Product Development Flow covers this rigorously in the context of lean product development, and the conclusions map directly to software review. High utilization of shared resources is one of the biggest sources of latency in any development process, and review bottlenecks are shared resources by definition.

What Review Actually Catches

The usual justification for adding review layers is quality. The empirical picture on this is messier than people assume.

Caitlin Sadowski and colleagues’ 2018 ICSE study of code review at Google is one of the most thorough examinations of what reviewers actually do. The top reasons developers said review was useful: catching logic errors, ensuring the code is understandable, and transferring knowledge. These are real benefits. But the study also found that most comments in practice are about code style, naming, and minor refactoring, not about catching bugs that would reach production.

SmartBear’s earlier study, Best Kept Secrets of Peer Code Review, found that inspection rates faster than about 500 lines per hour produced dramatically worse defect detection. Reviews that go quickly are mostly not catching the bugs they’re supposed to catch. They’re producing a feeling of oversight without much of the substance.

The Microsoft Research team has found similar patterns. Reviewers working asynchronously on large diffs often focus on surface-level concerns because the cognitive overhead of understanding deep logic across a large changeset is too high. The result is that the most serious bugs, the ones that affect system behavior in subtle ways, tend to slip through review and get caught in production or by automated tests.

None of this means review is useless. It means the assumptions used to justify stacking review layers are often not grounded in what review actually accomplishes.

The Organizational Ratchet

Review layers almost never get added through deliberate design. They accumulate through incident response.

Something goes wrong in production. The postmortem identifies that a particular change wasn’t reviewed by the security team, or the database team, or legal. A new review requirement is added. The next incident leads to another. Three years later, a simple config change requires six sign-offs and takes two weeks to ship.

No one ever removes a review stage. Removing it feels like accepting risk. Adding one feels like prudent risk management. The asymmetry means the ratchet only clicks in one direction, and teams that have been around long enough end up with process debt that looks a lot like technical debt: accumulated decisions that individually seemed reasonable but collectively make the system dysfunctional.

Pennarun’s piece is partly a critique of this dynamic. The argument isn’t that review is bad; it’s that organizations don’t account for the cost of adding review requirements the way they account for other costs. No one says “this new review stage will increase average time-to-ship by 30% and reduce the team’s throughput by 20%.” They say “we need to make sure this doesn’t happen again.”

What High-Performing Teams Do Instead

The teams that ship quickly without sacrificing quality tend to use different mechanisms than adding more human review stages.

Feature flags decouple deployment from release. You can ship code to production continuously and enable features gradually, giving you the ability to roll back instantly without a review process in the rollback path. This addresses a large class of incidents that review is trying to prevent.

Automated testing, particularly integration and end-to-end tests that run on every commit, catches more of the logic errors that review theoretically catches but often misses. The advantage of automation is that it’s consistent; it doesn’t get fatigued, doesn’t miss things because the diff was too large, and doesn’t add queueing latency.

Small batch sizes reduce the cost of each individual change and the severity of mistakes. A PR that changes 50 lines gets better review than one that changes 500, and it’s easier to roll back when something goes wrong. Google’s own internal guidance, despite having formal review requirements, strongly emphasizes keeping changes small precisely because review quality degrades with size.

Pair programming is often dismissed as inefficient but is actually close to the opposite: defect rates in pair-programmed code are substantially lower, and the review cost is zero because the review happens during writing. For high-stakes code changes, pairing is often faster end-to-end than writing, then waiting for review, then responding to comments, then waiting again.

Netflix’s approach, documented in their culture deck and various engineering blog posts, centers on context over control. Rather than adding approval gates, they invest in making sure engineers understand the business and technical constraints well enough to make good decisions independently. This doesn’t work without strong hiring and substantial investment in knowledge sharing, but it produces teams that can ship without the latency of serial review chains.

Where Review Is Worth It

Pennarun’s argument, and my reading of the evidence, isn’t that review should be eliminated. It’s that review should be reserved for situations where its costs are justified by its benefits.

Security-sensitive code, cryptographic primitives, authentication flows, and anything that touches financial transactions deserves careful review by someone with relevant expertise. The blast radius of a mistake is high enough that the latency cost is justified. Legal and compliance review for externally visible changes has real value for similar reasons.

Initial architecture review before a significant new system is built can catch design mistakes that would be expensive to fix later. This is different from requiring approval on every subsequent change to that system.

Knowledge transfer review, where the goal is explicitly to spread understanding of a codebase rather than to catch bugs, is worth doing but should be acknowledged as a knowledge-transfer activity, not a quality gate. The expectations and the process should match the actual goal.

The problem isn’t any single review stage. It’s treating review as a general-purpose solution to shipping risk, stacking stages until the latency becomes untenable, and never auditing whether the stages that exist are still earning their cost. Most teams could remove two-thirds of their review requirements, replace them with better automated testing and smaller batch sizes, and end up with both faster shipping and better quality. The review layers feel like safety. In many cases they’re mostly overhead that has accumulated faster than anyone noticed.

The 10x claim is the kind of number that’s meant to provoke, but it’s close enough to what queueing theory and organizational dynamics actually produce that dismissing it as hyperbole would be a mistake.

Was this interesting?