Code Review Is a Queue, and AI Coding Tools Are Making It Longer

The pitch for every AI coding tool on the market is the same: you will write code faster. GitHub Copilot’s own research claims developers complete tasks 55% faster. Cursor markets around uninterrupted flow state. Every new model release leads with coding benchmarks. The assumption threading through all of it is that writing code is the bottleneck on software delivery, and reducing it is the lever worth pulling.

That assumption does not hold up when you look at where developer time actually goes.

Research from McKinsey in 2021 estimated that developers spend roughly 35% of their time on activities that directly produce code: writing, local debugging, and testing. The remaining 65% goes to meetings, requirements gathering, reviewing others’ code, waiting on CI pipelines, resolving review comments, and coordination. A 2018 Stripe report on the developer coefficient found that roughly 42% of developer time goes to technical debt and maintenance rather than new development. The Stack Overflow and GitLab DevSecOps surveys cluster around similar numbers: active code writing accounts for somewhere between a quarter and a third of the working week.

If code writing occupies 30% of your time and a tool makes it 55% faster, you have saved roughly 16% of your total working hours. The other 84% is untouched. And that remaining 84% includes the thing that is almost certainly your actual constraint.

The Queue Math Is Damning

Code review behaves like a queue. There is a steady arrival rate of pull requests and a finite service capacity from reviewers. Queuing theory, specifically the M/M/1 model, gives you the expected wait time as:

W_q = ρ / (μ(1 - ρ))

where ρ is reviewer utilization (how busy they are as a fraction of capacity) and μ is the service rate. The relationship is nonlinear and brutal. At 50% utilization, wait time equals the service time. At 80%, it is four times the service time. At 90%, nine times. At 95%, nineteen times.

These are not worst-case estimates. They are expected values under steady-state load. A senior engineer who has code review as roughly 20% of their job, but is also attending meetings and doing their own work, can easily be running at 80% or 90% effective utilization for review. In that range, small increases in PR arrival rate produce large increases in wait time.

Chaining approvals compounds multiplicatively. If a change requires two reviewers each at 80% utilization, the wait is 4x times 4x, or 16 times the baseline service time. Three reviewers: 64 times. Four: 256 times. This is why adding a required approver in response to an incident has a nonlinear effect on delivery speed. You are not adding a queue; you are multiplying queues. Avery Pennarun has written about this effect and notes the compounding actually makes “10x per layer” a conservative estimate in real teams.

Branch protection rules and CODEOWNERS files in GitHub make adding blocking review requirements trivially easy. The path for configuring post-merge or non-blocking review is far less visible in the UI. Teams accumulate gate requirements through incident response: something breaks, a new required reviewer is added, and nobody ever removes it. The asymmetry of consequences ensures accumulation continues indefinitely.

The Symmetry That AI Broke

Before AI coding tools became widespread, there was a rough symmetry between the cost of writing a change and the cost of reviewing one. Both were human-speed operations. A feature that took two days to write might take a few hours to review, and a small team’s overall contribution rate was bounded by how fast its members could write.

AI tools broke the production side of that equation. Generating a pull request is nearly free now. Reviewing it has not moved.

If a developer who previously opened two pull requests per week can now open eight, that developer’s work sits in the review queue at four times the rate. The reviewer, doing the same work at the same pace, falls behind. This is not a hypothetical. METR’s research found that many patches that pass SWE-bench would not actually be mergeable in the target projects, because technical correctness and architectural fit are different things, and evaluating the latter is cognitive work that does not get faster because code was generated rather than written by hand.

The original article on debugging leadership frames this precisely: the constraint moved, but nobody changed the process to address where it went.

What DORA Research Shows

The DORA State of DevOps report has been measuring software delivery performance for over a decade. Its four key metrics, lead time for changes, deployment frequency, change failure rate, and mean time to restore, paint a clear picture of what separates high-performing from low-performing teams.

Elite performers achieve lead times under one hour from commit to production and deploy multiple times per day. Low performers measure lead times in weeks to months. The gap is not explained by individual coding speed. It is explained by CI/CD pipeline maturity, automated testing coverage, deployment automation, and cultural norms around small batch sizes and rapid review.

Nancy Forsgren’s research for the book Accelerate identified the technical practices that predict organizational software delivery performance: trunk-based development with frequent small commits, automated test suites that run in under ten minutes, feature flags to decouple deployment from release, and fast rollback capabilities. These practices all compress the time between code complete and production. None of them make code faster to write; all of them reduce the accumulation in the pipeline downstream of writing.

A team with a 48-hour code review cycle and a 30-minute CI pipeline is not constrained by how fast engineers type. A tool that makes engineers generate code twice as fast feeds work into that 48-hour wait at twice the rate.

What Actually Happens at the Review Stage

Even setting aside the queue dynamics, the review stage has its own quality problems that are worth naming. SmartBear’s study on peer code review found that reviewing faster than roughly 500 lines per hour produces dramatically worse defect detection. Reviewers working at speed focus on surface concerns because the cognitive overhead of understanding deep logic across a large diff is too high. Google’s ICSE 2018 study by Sadowski et al. found that in practice, most review comments concern code style, naming, and minor refactoring rather than logic errors or production-bound bugs.

A second reviewer might catch around 20% of what the first reviewer missed, with declining returns for each additional reviewer added. The bureaucratic cost does not decline in the same proportion.

If AI tools increase the rate at which code arrives for review without improving its architectural quality or fit, reviewers face more surface area and more demand for the same cognitive work they were already doing. The review gets harder, not easier.

The Diagnostic Value

There is one genuinely useful thing AI coding tools do for teams stuck in this situation, even when they do not improve delivery speed: they make the actual constraint visible and undeniable.

When coding was slow, slow shipping could be plausibly attributed to slow coding. Now that code takes half a day rather than three, the three-day PR queue is exposed as the real problem. Teams that adopt AI tools and see no improvement in deployment frequency are learning something important about where their constraint actually is. That is valuable information, even when it is uncomfortable.

The organizations that see genuine improvement from AI tools tend to share a few characteristics: they already had fast review cycles, they used trunk-based development or very short-lived branches, their CI pipelines ran quickly, and their deployment processes were automated and low-friction. In those environments, removing friction from code writing actually propagates through to the delivery metric, because writing was genuinely close to the constraint.

For teams where the pipeline was the bottleneck before, faster writing just moves pressure onto the part of the system that was already struggling.

Where to Start

Measurement is the only honest starting point. Most teams do not have precise data on where time accumulates between code complete and production deploy. If you instrument your pipeline and find that lead time averages four days, and code review accounts for three of them, you have identified your constraint. Investing in better tooling for reviewers, reducing required approval chains, adopting feature flags to make changes smaller, or building automated checks that reduce the cognitive load on reviewers will all move the needle in ways that are directly attributable to the measurement.

The DORA Quick Check takes about five minutes and gives a reasonable starting point for locating where a team sits on the performance distribution and which practices are likely to produce improvement. It does not ask about which AI coding tools you use.

Faster code writing is not a bad thing. It is a good thing in the right context. The context where it matters most is a team that already has a fast, automated, low-friction path from commit to production. For everyone else, the bigger problem was always somewhere else in the pipeline, and the honest engineering work is finding it.