· 7 min read ·

The Trust Substitute: Why Review Gates Accumulate and What They Cost

Source: hackernews

Avery Pennarun’s recent post makes a deceptively simple claim: every layer of review multiplies your slowdown rather than adding to it. The post gathered over 500 upvotes and nearly 300 comments on Hacker News, most of which took the form of “yes, but what about regulatory requirements” or “our team had an incident and review saved us.” The math in the post is mostly right. The more interesting question is why organizations keep adding review layers despite the cost, and why they almost never remove them.

The Queue Mathematics

Review processes are queues, and queues behave in ways that feel counterintuitive until you look at the theory. The Pollaczek-Khinchine formula for M/G/1 queues gives the expected wait time:

W_q = (ρ × E[S]) / (1 - ρ)

Where ρ is utilization (the fraction of time the reviewer is busy), E[S] is average service time per review, and W_q is average wait time. At 50% utilization, wait time equals service time. At 80% utilization, wait time is 4× service time. At 90%, it is 9×. At 95%, 19×.

The key observation is that reviewers are almost never dedicated to review. They have their own work, meetings, and interrupts. In practice, reviewer utilization for any given incoming review approaches the upper end of that curve, because everyone’s calendar is already full. That is when wait times stop growing hyperbolically and start feeling practically unbounded.

Now add a second review layer. Each layer has its own queue. If both reviewers are at 80% utilization and each review takes an hour of actual work:

Layer 1 wait: (0.8 × 1h) / (1 - 0.8) = 4h
Layer 2 wait: (0.8 × 1h) / (1 - 0.8) = 4h
Total elapsed: ~10h for 2h of actual review work

Without review, you ship in 1 hour of implementation work. Two review layers at 80% reviewer utilization cost you 10 hours of clock time. A third layer pushes it to 14 hours. The compounding is real, though the “10x per layer” framing is a heuristic rather than a precise formula. The actual multiplier depends heavily on reviewer utilization, which varies by team and context.

The DORA Data

This is not just theoretical. The DORA State of DevOps research, which has tracked software delivery performance across thousands of teams since 2014, consistently finds dramatic differences in lead time between high-performing and low-performing organizations. Elite performers deploy on-demand with lead times under an hour. Low performers report lead times of one to six months.

The variable that correlates most strongly with lead time is not test coverage, language choice, or team size. The DORA research found that deployment frequency and lead time are positively correlated with change failure rate and mean time to restore, in the direction most people get backwards. Teams that deploy frequently have lower failure rates, not higher ones. Shipping small, fast changes produces quality; reviewing changes harder before they ship does not reliably do the same.

Nicole Forsgren’s analysis in Accelerate documents the statistical models behind this. High-performing teams achieve better outcomes on all four key metrics simultaneously. Review overhead does not trade speed for quality; it degrades both, because slow feedback loops mean defects compound before they surface.

Microsoft Research’s study on code review found a related pattern: review thoroughness drops significantly above 200 to 400 lines of change. Large PRs sit in queues longer and receive shallower reviews. The process designed to catch defects becomes less effective precisely when it is most needed.

The Incident-Policy Ratchet

The queuing math explains why review layers are expensive. What it does not explain is why organizations accumulate them over time. The mechanism is straightforward: review policies are driven by incidents, and the ratchet only turns one way.

When a production database gets dropped, the postmortem identifies a missing control. When a secret gets committed to a repository, a secrets scanner goes into CI. When a PR merges without adequate testing, the required approvals count increases from one to two. Each gate is individually defensible. Each was added in response to a real problem. Together they produce a system where shipping a one-line configuration change requires navigating four asynchronous approval queues across three time zones.

Nobody runs a postmortem on “we shipped slowly and lost a customer to a competitor” and concludes that an approval step should be removed. The asymmetry is structural. The engineer who caused the dropped database faces direct, attributable consequences. The team that shipped slowly and missed a market window faces diffuse costs that never land on any individual’s performance review. So the incentive is always to add gates after incidents, and there is no symmetric incentive to remove them.

Google’s Site Reliability Engineering book addresses this under the concept of toil: work that is manual, repetitive, automatable, and grows without bound unless actively managed. The prescription is to budget explicitly for toil reduction and treat accumulated process debt like technical debt. In practice, organizations rarely apply this to review processes, because the toil is invisible to anyone who does not experience it directly, and the people who experience it are usually not the ones who added the gates.

The Trust Deficit

Under the policy accumulation is a trust problem. Each review layer represents an organization’s working answer to the question of whether it trusts this person to make this decision without supervision.

In high-functioning teams, the answer to most routine questions is yes. This is not naivety; it is the product of investment in hiring, onboarding, and shared context. When engineers have spent months working closely together and have built shared understanding of the system, trust can be calibrated. You know where a colleague’s judgment is reliable and where it warrants a second look.

In dysfunctional review cultures, the answer to most questions is no, not because individuals are incompetent, but because the organization has never built the infrastructure for calibrated trust. Documentation is sparse. Architectural decisions are tribal knowledge. The only way to know if a change is safe is to ask someone who already knows, and that person is usually the same one who is already in every other approval chain.

Review, in this model, is a substitute for shared context. It is expensive because shared context is expensive to build, and organizations often choose to pay the review tax indefinitely rather than make the upfront investment in knowledge transfer, documentation, and pairing.

What Fast-Shipping Teams Do Differently

Teams that ship quickly without sacrificing stability tend to invest in a few specific things.

Automated checks handle the categories of error that review gates were designed to prevent. Static analysis, type checking, comprehensive test suites, and infrastructure-as-code validation catch bad merges without the queue time. A linter that runs in 30 seconds does more to prevent a class of defects than an async review that might happen in three days, and it does it at every commit rather than on a sample.

They keep changes small. A 10-line PR gets reviewed in minutes because the reviewer can hold the whole thing in their head. A 500-line PR sits in the queue because nobody wants to engage with it, and when someone does, the review is shallower than it looks. Change size has a compounding relationship with both queue time and review quality.

They distinguish between categories of change. Architectural decisions involving new patterns, new dependencies, or new operational requirements warrant synchronous discussion and careful review. Routine changes to well-understood systems warrant fast async approval or no review at all. Treating all changes identically is what drives policy accumulation, because the controls designed for high-risk changes get applied to everything.

They measure review overhead explicitly. This is rare. Most teams measure defect rates and incident frequency. They do not measure review queue time, approval chain depth, or the cumulative delay introduced by process layers. Without measurement, there is no feedback loop. Gates accumulate because the cost is invisible until someone computes it, and the people positioned to compute it are usually not the ones adding the gates.

The Measurement Gap

Pennarun’s “10x per layer” framing will generate debate about whether it is precisely accurate. That debate is somewhat beside the point. The more important observation is that most organizations have no idea what their review overhead actually costs them in throughput, because they are not measuring it.

The teams that do measure, even informally using tools like LinearB or the open-source Swarmia analytics, tend to discover that the overhead is much larger than anyone estimated. They also tend to find that most of the delay comes from a small number of bottlenecks, usually one or two engineers who are in every approval chain because they are the only ones who understand the relevant part of the system.

The fix for that is not another review gate. It is documentation, knowledge transfer, and building redundant expertise across the team. It is expensive in the short term, invisible until it starts working, and produces no incident report to justify the investment. That combination of factors is exactly why it does not happen, and why the ratchet keeps turning.

Was this interesting?