· 6 min read ·

What a Long Approval Queue Reveals About Your Test Coverage

Source: hackernews

Avery Pennarun’s recent post argues that each review layer multiplies your cycle time rather than adding to it. The queueing math is sound: when reviewers are at high utilization, wait times scale nonlinearly with load, and chaining independent queues compounds that nonlinearity. If you have three required approval layers and your reviewers are 85% utilized, the total wait is not the sum of three individual delays. It is the product, and the product at that utilization can turn a two-hour change into a two-week wait.

The Hacker News response to the post, nearly five hundred points and almost three hundred comments, suggests it touches something real. Most of the discussion has focused on the cost side of the equation: how to make review faster, whether the 10x number is precise, which industries have mandatory review for good reasons. The question that gets less attention is what the review is actually catching, and what the answer to that says about the quality infrastructure surrounding the review process.

A Taxonomy of What Review Finds

Code review catches several distinct categories of problem, and each category responds very differently to non-review alternatives.

Style and convention issues. Variable naming, import ordering, formatting, doc comment format. A reviewer who surfaces these is doing work that a linter could do faster, more consistently, and without consuming a human’s attention. Tools like ESLint, Prettier, ruff, rustfmt, and gofmt handle this category mechanically. When review cycles include substantial style feedback, that is linter work routed through a human queue.

Logic errors and edge cases. The off-by-one in the loop counter, the missing null check, the incorrect assumption about input range. This is the category most people believe review is for, and human review does catch these. But unit tests, integration tests, and property-based tests can formalize these assertions permanently rather than catching them once. A property test that verifies a function holds for arbitrary inputs will find more edge cases than any reviewer reading the code once. A type system that makes illegal states unrepresentable eliminates entire categories of logic error before the change is even submitted.

Security vulnerabilities. Common injection patterns, insecure dependencies, misconfigured permissions. Static analysis tools run on every commit, catch the same class of issue every time, and do not require a security-minded reviewer to happen to be assigned. SAST tooling is imperfect, but so are humans, and the tooling does not get distracted or fatigued across a queue of fifteen open PRs.

Architectural and design concerns. Whether a new module belongs in the service layer or the data layer, whether a new API surface is consistent with the existing conventions, whether a change will create a maintenance burden in six months. These judgments require understanding of the system’s history and intent in ways that no current tooling replaces. Human review has genuine, hard-to-replicate value here.

The pattern that falls out of this taxonomy is that the first three categories have better tools available, and only the fourth genuinely requires experienced human judgment. When approval chains are primarily catching style, logic, and common security issues, they are compensating for gaps in tooling and test coverage, not providing something qualitatively different.

Why the Tooling Gaps Persist

Building a strong automated quality layer takes investment. A comprehensive test suite takes time to write. Property tests require explicitly thinking about invariants. A well-configured SAST pipeline takes time to tune so the false positive rate is low enough to be actionable. Organizations under delivery pressure frequently defer this investment because it does not produce visible features, and compensate instead with review requirements that produce a visible quality signal.

The DORA research across thousands of engineering organizations has consistently found that elite performers have both high deployment frequency and short lead times. The short lead times come in part from automated quality gates that run in minutes rather than days. Organizations with extensive manual approval chains tend to appear in the mid and low performer cohorts, not because review is bad in principle, but because the review requirements correlate with weak automated quality infrastructure. The causation runs in both directions: weak tooling necessitates more review, and a culture accustomed to review as the primary quality gate invests less in tooling.

Decoupling Deploy from Release

A significant portion of review overhead comes from the pressure to get code exactly right before it touches users. Feature flags reduce that pressure by separating the moment of deployment from the moment of exposure. A change can deploy to production, run dormant behind a flag, be enabled for one percent of users, and only then be rolled out fully. The rollback surface for any individual merge shrinks considerably.

This does not make review unnecessary, but it changes what review is blocking. It is blocking a deploy, not a release. If the deploy reveals a problem before the flag is fully enabled, rollback is straightforward. The review cycle is no longer the last line of defense, and the consequence of something slipping through drops proportionally. Teams that have adopted this approach consistently report that their per-PR review requirements decrease over time because the downstream safety net is stronger.

How Google Structures the Problem

Google’s engineering practices documentation describes a model where mandatory automated checks must pass before a human reviewer is assigned. The automation handles style, test failures, and build breakage. Human review is then reserved for the fourth category: design, consistency, and architectural judgment. Reviewers spend their attention on things where human judgment adds value because the mechanical categories are already filtered out.

The result is that human review is faster because it is narrower. A reviewer who is not asked to check formatting or spot obvious null pointer issues can give real attention to the module boundary decisions and the API surface questions. The review is also higher quality, because reviewers are not context-switching between mechanical checklist items and the conceptual questions that actually require thinking.

The One Thing Review Cannot Be Replaced By

Knowledge transfer. Code review is a knowledge distribution mechanism as much as a quality mechanism. The reviewer learns what changed and why; the author gets feedback from someone who approaches the code differently. On teams that grow and rotate, this compounds: review builds shared mental models of the codebase that are difficult to build through documentation alone.

This use case is real and it argues for review. But it does not argue for blocking review, where the change cannot merge until every reviewer has signed off. A practice of non-blocking review, where a PR can merge once automated checks pass but assigned reviewers still receive notifications and can comment, preserves the knowledge transfer value without inserting the merge into a human queue. The change ships; the conversation about it continues. Both things are true simultaneously.

What the Queue Length Is Telling You

Four required approvals before anything merges is a data point. It is probably not evidence that the team is unusually careful. More likely it means that the automated quality layer has gaps, that trust in individual engineer judgment is low, and that review has become the primary mechanism for maintaining confidence in what ships. That is a reasonable response to those conditions, but the conditions are worth examining.

Pennarun’s 10x per layer framing is useful because it attaches a cost to something that normally goes unmeasured. The natural response is to ask how to make the queue move faster: smaller PRs, async review tools, reviewer rotation policies. Those are worth pursuing. But before optimizing the queue, it is worth auditing what is in it. If three of the four required approvers are primarily catching things that a well-maintained test suite and linter could catch automatically, the intervention is not faster reviewers. It is fewer required reviewers, with better tooling carrying the load they were compensating for.

Was this interesting?