· 6 min read ·

Your Velocity Metric Is Watching the Wrong Stage of the Pipeline

Source: hackernews

Teams are buying AI coding tools to solve a problem that was not limiting them in most cases. Code gets written faster. Reviews pile up at the same rate. The pipeline clears on the same schedule. Deployment happens on Thursdays, or whenever someone remembers to push the button, or after a change advisory board convenes.

Andrew Murphy’s recent post makes this observation directly. The more interesting question is why teams kept telling themselves that writing code was the bottleneck, and what a decade of DevOps research shows about where the constraint actually sits.

What DORA Has Been Measuring Since 2014

The DORA (DevOps Research and Assessment) program has tracked software delivery performance since 2014 and publishes the annual State of DevOps report. Their four key metrics are deployment frequency, lead time for changes, change failure rate, and time to restore service. Elite-performing teams are not faster at writing code than low-performing ones. They differ sharply on lead time.

Lead time for changes measures the elapsed time from a commit landing in version control to that commit running in production. The 2023 State of DevOps report found that elite performers achieve lead times under one hour. Low performers measure theirs in months. That difference is not explained by typing speed or AI autocomplete. It is entirely explained by what happens after the code is written.

Lead time decomposes into concrete stages: time for a PR to get picked up for review, time for review to complete, time for CI to execute, time for approval gates to clear, time for the deployment to run, and time for post-deploy verification. Writing the code is one input at the very beginning of this sequence. It is rarely the slow stage.

Little’s Law and the Queue That Keeps Growing

There is a reason software delivery is modeled using operations research tools. A software delivery pipeline behaves like a queue, and Little’s Law describes what happens to queues when you change their inputs without changing their processing capacity.

Little’s Law: L = λW. The average number of items in a system equals the arrival rate multiplied by the average time each item spends in the system. If you increase λ (more PRs per day, because developers write code faster) but keep W constant (same reviewers, same review bandwidth), L goes up. More items are sitting in the queue at any given moment. Average wait time increases.

This is not a metaphor. When a team adopts AI coding tools and sees PR volume climb without adding review capacity, they are running this experiment live. Code arrives faster; the queue grows longer; lead time gets worse.

Eliyahu Goldratt’s Theory of Constraints makes the same point from a manufacturing perspective, and Gene Kim applied it directly to software delivery in The Phoenix Project. Every system has one constraint that determines its overall throughput. Improving a non-constraint activity does not improve throughput; it just changes where work accumulates. If review is the constraint, faster code writing is not an improvement to the system. It is acceleration toward a wall.

Where Developer Time Actually Goes

A 2023 benchmark report from LinearB found that engineers spend roughly 13% of their working hours writing new code. Review, context switching, meetings, and waiting for feedback account for the majority of the rest. A study from DX found that uninterrupted flow time was the strongest single predictor of both developer productivity and satisfaction, and the primary enemy of flow time was context switching between active work and reviewing others’ code.

These numbers vary by team and codebase, but the direction is consistent across multiple sources. Writing code is a small fraction of the total time a change spends moving from idea to production. Optimizing that fraction improves the experience of writing code; it does not meaningfully improve delivery throughput.

AI-Generated Code Creates a Harder Review Problem

There is a specific way AI tools worsen the review bottleneck rather than simply leaving it unchanged. Code written with AI assistance tends to be more verbose and less idiomatic than code written by someone who fully understands the context. Reviewers read more, reason about less implicitly, and verify more carefully.

A GitClear analysis tracking the impact of AI coding assistants on code quality found that code churn, meaning code that is written and then substantially revised or reverted within a short window, increased alongside total code output. More code arriving at review, with higher churn rates, means each review cycle takes longer and more often generates multiple back-and-forth iterations.

For open-source projects, this effect is already visible at the maintainer level. Projects are reporting higher PR volume with lower average quality per PR. Maintainer review bandwidth is inelastic; it is constrained by the number of people who understand the codebase deeply enough to review changes responsibly. Adding PR volume to that system without adding maintainer capacity degrades average review latency for everyone.

What the Actual Interventions Look Like

If review throughput is the constraint, the interventions that improve delivery performance are the ones that increase review capacity or reduce review burden per change, not the ones that increase the rate at which PRs arrive.

On the tooling side: automated static analysis that runs before human review begins, test coverage gates, better diff visualization, and code ownership routing that assigns PRs to the right reviewer immediately rather than leaving them to sit unassigned. Tools like GitHub’s pull request summaries and automated review assistants are at least attempting to address the right stage of the pipeline.

On the process side: explicit work-in-progress limits on the review queue, review sessions scheduled into sprints as first-class work rather than treated as interruptions to real work, and lead time tracked as a primary engineering metric alongside deployment frequency. Teams that measure what they are trying to optimize tend to improve it.

On the architectural side: smaller, more frequent changes reduce review complexity per unit of work. Trunk-based development, feature flags, and continuous delivery practices all make individual changes easier to review by reducing their scope and blast radius. A PR that touches 40 lines across 3 files gets reviewed faster and more accurately than a PR that touches 400 lines across 18 files, regardless of how the code was written.

Why Velocity Points at the Wrong Stage

Teams keep reaching for code-writing speed improvements because those improvements are legible. A developer who completes five stories per sprint is visibly more productive than one who completes three. The review queue is less visible because it sits between completion and delivery, in a part of the process that most velocity metrics do not capture.

AI coding tools entered the market inside this same measurement frame. Tools that made developers faster at writing code had an easy value proposition: point at the velocity number, watch it go up. Tools that improve review efficiency are harder to evaluate because their impact shows up in lead time for changes, not in story points per sprint.

This is gradually shifting. DORA metrics, the SPACE framework from Microsoft Research, and the DX Core 4 model are all attempts to capture the full delivery pipeline rather than just its first stage. As teams adopt these frameworks and start tracking lead time seriously, the review bottleneck becomes visible, and the value of addressing it becomes quantifiable.

Murphy’s observation lands at a useful moment. The disconnect between AI tool adoption and actual delivery improvement is becoming harder to ignore as code output rises without a corresponding reduction in lead times. The question teams need to answer is not whether their developers write code fast enough. It is how long changes sit between written and shipped, and which stage of that journey is slowest.

That answer is usually review. It was usually review before AI tools existed. Making the problem clearer is something, even if it is not the same as solving it.

Was this interesting?