· 6 min read ·

The Three Problems AI Coding Tools Don't Touch

Source: lobsters

Andrew Murphy’s post arguing that code-writing speed was never your real problem landed with enough force on Hacker News to generate 205 comments, which means a lot of developers recognized something true in it. The argument is correct. But I think the framing of “a single bigger problem” understates the situation. There are at least three distinct constraints that sit downstream of writing speed, each one large enough to absorb all the velocity gains AI tools can provide, and most teams are constrained by more than one of them simultaneously.

Writing code is not the hard part

Developers spend roughly 30-35% of their time on tasks that directly produce code. A 2021 McKinsey survey put it at about 35%; the Stack Overflow Developer Survey has landed in a similar range for years. The rest goes to meetings, requirements gathering, reviewing code, waiting on CI pipelines, resolving review comments, deployment coordination, and context switching.

GitHub’s research on Copilot measured a 55% reduction in task completion time for isolated coding exercises. Apply that to the 35% fraction, and you get roughly a 19% improvement on total working time, under ideal conditions, assuming no second-order effects. The 65% is untouched. If any part of that 65% is the binding constraint, the delivery time for features barely moves.

I build Discord bots. When I use AI tools for the mechanical parts, boilerplate command handlers, embed formatting, permission scaffolding, event listener wiring, the work genuinely goes faster. That part is cleaner and less tedious. But I am essentially a solo developer on these projects, so the constraints look different for me than for a team. What slows me down is: not knowing exactly what a command should do in an ambiguous case, testing behavior that depends on Discord’s API in ways that are hard to simulate locally, and making architectural calls about state management across shards. None of these have changed at all.

For teams, those same friction sources scale up and interact with one another, and three of them deserve specific attention.

Constraint one: requirements clarity

Code review is commonly identified as the binding constraint after writing speed, and the queuing theory supports that view. But upstream of review is the question of whether what’s being reviewed was the right thing to build. A pull request for a feature specified ambiguously is not reviewable in any useful sense. Reviewers can check that the code does what the spec says, but if the spec was wrong, the review passes a feature that should have been rejected at the requirements stage.

Requirements clarity is the hardest constraint to see because it manifests as rework, not wait time. A PR gets merged, the feature ships, and then someone notices it doesn’t match what was actually needed. The cost shows up as a second ticket, a rollback, a hotfix, or an expanding scope on the next feature. None of this appears in review latency metrics or deployment frequency.

AI tools make this constraint more acute. A developer using Copilot or Cursor can implement a feature specification at roughly twice the historical rate. If the specification was imprecise, they can implement the imprecise specification twice as fast, leading to twice as many review cycles where reviewers catch that the implementation is technically correct but functionally wrong. Faster writing with unclear requirements produces a higher volume of correct-but-wrong code.

The DORA research identifies this indirectly. Unclear requirements are consistently among the top-reported blockers across the annual State of DevOps survey, alongside technical debt and slow review cycles. It predates AI acceleration and will outlast it.

Constraint two: review bandwidth

This is the constraint most commonly named, and the queuing math behind it is genuinely important. An M/M/1 queue at 80% utilization produces average wait times four times longer than the service time. Chain two reviewers together at 80% each and the multiplier is not 4 + 4 = 8 but 4 × 4 = 16. Three reviewers at 80% gives 64x. Avery Pennarun’s observation that each review layer makes you roughly 10x slower is a lower bound at typical utilization.

When AI tools increase per-developer PR production rate without a corresponding increase in review capacity, Little’s Law is precise about what happens: average work in progress equals arrival rate times average cycle time. If arrival rate doubles and reviewer throughput does not change, average open PRs double and average cycle time doubles. The delivery of any individual feature takes longer, not shorter, despite the code having been written faster.

Review bandwidth is constrained by both time and attention. A reviewer at 80% utilization is not just busy; they are cognitively loaded. Code review done well requires following logic, checking edge cases, and evaluating design decisions. Microsoft Research found that inspection rates above roughly 500 lines per hour dramatically reduce defect detection. A reviewer working through AI-generated PRs arriving faster than before is doing shallow reviews or falling behind. Neither outcome improves delivery.

Constraint three: deployment confidence

The third constraint rarely gets named as such. Teams that lack confidence in their deployment process compensate by batching changes, adding approval layers, requiring manual verification steps, and reducing deployment frequency. The psychological mechanism is straightforward: when deploying is risky, you deploy less, which means larger batches, which means higher per-deployment risk, which means deploying is even riskier.

The 2024 DORA State of DevOps Report found that elite performing teams, those deploying multiple times per day, had change failure rates below 5%. Low-performing teams, those deploying monthly or less, had higher change failure rates. The direction of causality is counterintuitive but well-supported: more frequent deployment with smaller changes produces lower failure rates, not higher ones. The mechanism is batch size; smaller changes are easier to test, review, and roll back.

Automated testing coverage is the primary driver of deployment confidence. Teams with high coverage can merge code with confidence that the test suite will catch regressions. Teams without it rely on human review and manual verification, which are slow, expensive, and less reliable. AI-generated code does not come with tests unless you specifically generate them, and recent METR research found that many patches passing automated benchmarks would not be merged into their target projects, largely because correctness on test cases and correctness in context are different properties.

What the adoption results tell you

The practical value of this framing is diagnostic. A team that adopts AI coding tools and observes no improvement in deployment frequency or lead time for changes has, at no additional cost, identified where their real constraint lives. If review queue length grows, review bandwidth is the constraint. If rework rates increase, requirements clarity is the constraint. If deployment frequency remains flat despite faster code production, deployment confidence is the constraint.

This is Goldratt’s Theory of Constraints running as an empirical test. Improving a non-constraint does not improve throughput; it pressurizes the actual constraint until the bottleneck becomes undeniable. Teams that adopted Copilot expecting faster delivery and saw review queues lengthen did not misuse the tool. They found where the constraint was. The tool worked as a diagnostic.

The teams that extract genuine delivery improvement from AI acceleration are the ones that treat the adoption as a reason to invest in the full pipeline simultaneously. Automated test coverage to build deployment confidence. Smaller PRs with clear scope to reduce review load per unit. Trunk-based development to avoid integration delays. Clear specification practices before any code, AI-generated or otherwise, gets written. The Accelerate research by Forsgren, Humble, and Kim documented these practices in 2018 as the primary predictors of software delivery performance; they remain so now, with AI tools having changed nothing about that.

Writing code faster was never the problem. The problems were always sitting in the other 65%.

Was this interesting?