· 6 min read ·

The Compounding Cost of Review: Why Approval Layers Don't Add Up, They Multiply

Source: hackernews

The claim in apenwarr’s recent post is simple and uncomfortable: each layer of code review doesn’t add a fixed cost to your shipping cycle, it multiplies the total cost. The title puts the multiplier at 10x per layer. The exact number varies by organization, but the multiplicative shape of the relationship follows directly from queueing theory, and the implications compound quickly.

Most teams frame review overhead additively. A PR takes two days to write, one day for review, one day to address feedback: four days total. A second required review layer makes it five. This framing misses the structure of the problem entirely.

Why the Math Isn’t Additive

The right model comes from queueing theory, specifically the M/M/1 queue. A reviewer is a server. Pull requests are jobs arriving at that server. The relationship between utilization (how busy the reviewer is) and wait time is hyperbolic, not linear. At 50% utilization, average wait time equals average service time. At 90% utilization, wait time is nine times service time. At 95%, nineteen times. The curve approaches infinity as utilization approaches 100%.

No reviewer works exclusively on your PRs. Senior engineers, the most common choices for required approvers, are typically the most heavily scheduled people in an organization. Their review utilization is high. When you add a mandatory second reviewer, you’ve placed your PR in two separate queues, in serial, each with its own utilization curve.

The resulting expected latency is not simply additive. If reviewer A has mean review latency L_A and reviewer B has mean latency L_B, the serial system’s expected completion time includes L_A + L_B plus variance terms that depend on how correlated their availability is. More critically, variance grows: your p95 and p99 delivery times grow faster than the mean. Predictability collapses before throughput does, which means estimates become useless before the slowdown is obvious to management.

The DORA State of DevOps reports have tracked this pattern across thousands of engineering organizations. Elite teams, by DORA’s classification, show pull request review times measured in hours. Low performers measure in days to weeks. Deployment frequency correlates with review latency more strongly than almost any other single process variable. The gap isn’t about how good the engineers are; it’s about how the queues are structured.

The Organizational Failure Modes

The purely mathematical picture is the optimistic case. Real approval chains introduce failure modes that queueing models don’t fully capture.

Consider what happens when a required reviewer is on vacation. The queue doesn’t drain slowly; it stops entirely unless someone has built careful delegation, which most organizations haven’t bothered to formalize. The PR ages. The author moves to other work. Context decays. When review finally happens, the author has to re-read their own code to rebuild understanding. Meanwhile, the rest of the codebase has moved on, and merge conflicts accumulate. What was a one-day review becomes a multi-day rework cycle.

With two required reviewers, the probability of at least one bottleneck at any given time roughly doubles, assuming independent availability. With three, you’re waiting for all three to have overlapping windows, and the joint probability compounds against you with each addition.

This is the pattern that appears in organizations that grew review requirements incrementally. Each individual requirement looked reasonable at the time. Someone shipped a bad deploy, so a domain expert became a required reviewer. A security incident led to mandatory security sign-off. An infrastructure outage produced a new ops approval requirement. No single decision was wrong in isolation. The accumulation is quietly catastrophic.

What Google’s Practices Reveal

Google’s engineering culture is instructive here, and Avery Pennarun worked there before co-founding Tailscale, so the perspective in his post carries direct experience at scale.

Google uses a monorepo and an internal code review tool called Critique. Their model has two tiers: readability (a one-time certification that you write idiomatic code in a given language) and per-change approval by an owner of the affected code. The design calls for single approvers per change, not sequential sign-off chains. Ownership is granular enough that the relevant approver typically works closely with that code and can turn reviews around promptly.

The contrast with a typical enterprise process, security review followed by architecture review followed by manager approval followed by QA sign-off, all in sequence, is stark. Google addressed quality through test infrastructure, ownership clarity, and static analysis, not through serial human gates. The quality bar is high; the process overhead is low by design.

Tailscale, where Pennarun currently works, is a small team shipping infrastructure software that a significant portion of the industry relies on. Small team constraints naturally limit how many review layers are possible, which forces discipline about what each review is actually supposed to accomplish.

What Review Is Actually For

The case against layered review is distinct from a rejection of review itself. The question is whether serial approval chains accomplish what teams believe they accomplish.

Code review serves real purposes: catching bugs, spreading knowledge, maintaining architectural consistency, and ensuring code is understandable to someone other than its author. Most of these goals are served well by a single thoughtful reviewer. Sequential approval from multiple parties, each focusing on a different slice of concern, adds marginal value against those goals.

The bug-catching case is weaker than most teams assume. Research by software metrics researchers including Capers Jones has consistently shown that code inspection finds roughly 60-70% of defects. That figure sounds meaningful until you compare it with testing: unit, integration, and end-to-end test suites find comparable percentages, and combining human review with automated testing finds more than either approach alone. A third or fourth human reader yields strongly diminishing defect detection returns; additional reviewers tend to catch the same issues as the first reviewer, or shift their attention toward style and formatting rather than logic.

Automated analysis closes most of the remaining gap, cheaply. Modern static analysis tools, type-aware linters, property-based testing, and fuzzing catch whole categories of bugs before any human reads the code. These run in seconds inside CI pipelines rather than days in a review queue. Investing in that infrastructure scales better than adding required reviewers.

Stacked PRs as a Partial Mitigation

One technique that has gained traction in PR-heavy workflows is stacked pull requests. Tools like Graphite and ghstack let you build smaller PRs layered on top of each other, continuing development against unmerged work while each individual unit moves through review independently.

This partially addresses the queueing problem by reducing service time per review unit, which improves throughput without requiring a process overhaul. It doesn’t fix the structural issue of serial required reviewers, but it brings trunk-based development patterns within reach for teams that work primarily through GitHub pull requests. A 500-line PR that blocks for three days becomes three 150-line PRs that each spend a few hours in review.

The more fundamental change is harder to sell internally: reduce required approver count to one for most changes, redirect the saved organizational bandwidth into better automated testing, and reserve multi-reviewer processes for genuinely high-risk changes such as security-critical code, public API additions, and database migrations. This is the approach that actually reflects where the quality leverage is.

The One-Way Ratchet

The hardest part of fixing layered review isn’t technical. Review requirements accumulate without a natural removal mechanism. Once someone becomes a required reviewer following an incident, removing them later reads as a statement that their oversight wasn’t valuable, which is politically fraught. The person added as a consequence of an outage is now invested in that role in the process. Removing them surfaces the uncomfortable implication that the incident could have been prevented some other way.

This creates a one-way ratchet. Approval chains expand. Delivery slows. Teams respond by submitting larger, less frequent PRs to amortize review cost, which makes each review harder and slower, which occasionally produces incidents, which generates new review requirements, feeding the same cycle again.

Breaking the ratchet requires treating the compounding math as a structural constraint rather than a policy preference. Organizations can choose to pay the cost knowingly, with a clear understanding of what it purchases in terms of quality assurance and what it costs in terms of delivery speed and predictability. They cannot opt out of paying it, and pretending the cost is additive rather than multiplicative is how teams end up surprised when their shipping cycle is measured in weeks rather than days.

Was this interesting?