· 5 min read ·

Solve the Code-Writing Problem and Inherit a Different One

Source: hackernews

There is a pattern that shows up whenever a manufacturing process gets optimized at one stage without examining the whole system. The improved stage produces faster, the queue in front of the next stage grows, and total throughput stays flat or degrades. Eliyahu Goldratt formalized this in The Goal under the name Theory of Constraints: a chain’s capacity is set by its weakest link, and improving any other link yields nothing.

Software teams are subject to the same dynamics. Andrew Murphy’s recent post on the subject puts it plainly: if code writing speed felt like your bottleneck, you were already working in an unusual situation. For most teams, the constraint lives downstream of the editor.

Where Time Actually Goes

The DORA State of DevOps research, running since 2014 and now maintained by Google, tracks four key metrics for software delivery: deployment frequency, lead time for changes, change failure rate, and time to restore service. The findings are consistent across years and sample sizes in the tens of thousands. Elite teams deploy on demand, multiple times per day. Low performers deploy monthly or less. The gap between them is not explained by how fast engineers write code.

Lead time, the elapsed time from a commit to that commit running in production, is where the texture of the problem shows up. On high-performing teams it is measured in hours. On low-performing ones it is measured in weeks. That spread does not come from typing speed. It comes from review latency, CI pipeline duration, deployment approval chains, environment availability, and the time code spends sitting in queues waiting for human attention.

The research in Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim makes the causal direction clear: deployment frequency and lead time are leading indicators of organizational performance, including commercial outcomes. The mechanism is feedback speed. Shorter cycles mean defects surface sooner, experiments conclude faster, and engineers spend less time maintaining mental context across long-lived branches.

The Queue Problem Is Multiplicative

Little’s Law from queuing theory states that average queue length equals average arrival rate multiplied by average wait time. In a code review process, this means that if engineers produce pull requests faster without a corresponding increase in review throughput, the queue length grows proportionally. If each pull request must pass two sequential reviews, each with independent wait times, the expected total wait is not the sum of the two waits but the product of the two independent queueing processes compounded by context-switching overhead on the reviewers.

This is the irony that AI coding tools currently produce. Tools like GitHub Copilot, Cursor, and Claude in agentic mode have meaningfully increased the rate at which developers can produce working code. That production rate feeds directly into the pull request queue. Reviewers, whose throughput is constrained by cognitive load and meeting schedules rather than typing speed, face more to review. The constraint tightens.

Some teams paper over this by reducing review standards. That solves the queue problem temporarily by reducing review quality, which then shows up later as increased change failure rate, more incidents, and more time spent on recovery. The DORA metrics capture this trade-off precisely: teams that optimize for speed while neglecting change failure rate tend to regress on both dimensions over time.

What High Performers Actually Do Differently

Trunk-based development is one of the most reliably separating practices between high and low performers in the DORA data. In trunk-based development, engineers commit to a single shared branch multiple times per day rather than maintaining long-lived feature branches. This keeps pull requests small, review latency low, and integration conflicts rare. The practice sounds simple. It requires significant investment in automated testing, feature flags, and deployment infrastructure to make it safe.

Automated testing coverage is the other major lever. When a CI pipeline can run a comprehensive test suite in under ten minutes with high confidence in the results, the human review burden shifts from correctness verification to design and intent. That shift reduces review time and makes reviews higher value. Without it, reviewers must either spend longer verifying correctness manually or accept unknown risk.

Psychological safety, which Forsgren’s research identifies as a predictor of organizational performance, affects how quickly reviews happen and whether defects get surfaced promptly. Teams where raising problems is treated as evidence of incompetence accumulate hidden defects. Teams where surfacing problems is normal and low-stakes catch defects earlier and fix them faster. This has nothing to do with how fast anyone types.

The Measurement Problem

Most engineering teams do not measure lead time. They measure story points, sprint velocity, or lines of code. These metrics are easy to collect and easy to optimize in ways that look good on a dashboard without improving delivery throughput. A team that inflates story point estimates while slowing deployment frequency is, by the DORA definition, moving backward.

Lead time and deployment frequency are harder to game because they measure outcomes visible to users, not activity internal to the team. A feature is not delivered until it is in production. A bug fix does not reduce customer exposure until it is deployed. These metrics force a team to confront the full delivery pipeline, including all the stages that have nothing to do with writing code.

The SPACE framework, proposed by Forsgren and others in ACM Queue, expands the measurement surface to satisfaction, performance, activity, communication, and efficiency. It explicitly pushes back against single-axis metrics like commit frequency or lines of code, arguing that developer productivity is a multidimensional phenomenon. Activity metrics, which most teams default to, measure only one narrow slice.

The Actual Problem Space

The post-AI-tooling landscape for most teams looks like this: the inner loop, meaning the time between starting to write code and having a working implementation, has compressed significantly. The outer loop, meaning the time between a working implementation and users benefiting from it, has not changed much. Requirements churn before code is written, review queues after code is written, deployment approvals, environment constraints, and stakeholder sign-offs all operate on timescales that have nothing to do with typing speed.

The teams that will see the most benefit from AI coding tools are the ones that have already invested in the outer loop: short-lived branches, fast CI, automated deployment, small pull requests, high reviewer availability. For those teams, accelerating the inner loop has a compounding effect because the constraint is now genuinely in the inner loop. For teams that have not made those investments, faster code generation moves the queue from a manageable size to an overwhelming one.

Fixing the inner loop first is the wrong sequence. The right sequence is to identify your actual constraint using real delivery metrics, fix that, and then reassess. If code generation speed genuinely shows up as the bottleneck after the outer loop is healthy, the AI tools are waiting. But most teams are not at that stage, and no amount of autocomplete will get them there.

Was this interesting?