· 7 min read ·

Velocity Is Not Productivity, and AI Codegen Is Making That Gap Visible

Source: lobsters

Measuring software productivity has been an unsolved problem since before personal computers existed. Fred Brooks wrote in 1975 that you cannot measure software output in lines of code any more than you can measure aircraft construction by weight. Tom DeMarco expanded on this in the 1980s: measuring productivity in LOC incentivizes writing verbose, unstructured code rather than good code. These critiques never fully landed, because the alternatives — function points, story points, DORA metrics, the SPACE framework — all require more effort to collect and are easier to argue about.

AI code generation does not create a new measurement problem. It takes the existing one and makes it urgent.

What the Headline Numbers Actually Measured

When GitHub published its 2023 research on Copilot, conducted with researchers at Microsoft and the NBER, the headline finding was that developers completed specific coding tasks 55% faster. That number spread widely and still appears in executive presentations and vendor marketing today. But the study measured task completion time for discrete, well-defined exercises: implement an HTTP server, parse a CSV file, write a function to process a list. Participants worked alone, against a clock, on tasks chosen to be representative of what Copilot handles well.

Those tasks are not representative of what makes software development slow.

Software development slows down because of misunderstood requirements discovered three weeks into implementation, surprising failure modes that only appear under production load, coordination costs across teams with different assumptions, code review bandwidth that does not scale with team output, architectural decisions that turn out to be wrong six months later, and the accumulated weight of past shortcuts. None of these factors appear in a controlled task-completion study, because the study measures something different: how fast can a developer translate a fully-specified problem into code.

The translation step is the easy part. The hard part is everything before and after it.

What More Rigorous Research Found

The more careful measurements produce a more complicated picture.

A 2024 study by RAND Corporation examined Copilot’s impact on professional developers using their actual production work over several months, measuring real GitHub contributions rather than controlled exercises. The observed productivity gains were substantially smaller than the headline figures from vendor-funded research, and for some cohorts productivity declined. Developers spent significant time reviewing, correcting, and re-prompting AI output, which eroded the raw generation-speed advantage in ways that a task-completion benchmark cannot capture.

GitClear’s analysis of anonymized commit data across a large sample of repositories found a striking trend: the proportion of “churn code,” code written and then deleted or substantially rewritten within two weeks, roughly tripled between 2021 and 2024, a period that coincides with widespread AI coding tool adoption. Code that gets discarded represents effort that delivered no value. A team that generates twice as much code while half of it gets thrown away has not doubled its productivity.

The source article at Antifound.com makes the core point directly: generating code and building software are not the same activity. This distinction is worth taking seriously because the industry’s measurement apparatus consistently conflates them.

The Review Bottleneck That Does Not Scale

There is a systems-level problem that individual productivity measurements miss entirely.

If every developer on a team generates code faster because of AI assistance, code review does not automatically scale by the same factor. Reviewers are still human, and the cognitive cost of reviewing code you did not write is high regardless of how that code was generated. Each pull request requires understanding the intent, tracing the logic, spotting the edge cases, and verifying that nothing subtle was missed. AI-generated code is often stylistically consistent and superficially plausible, which makes it easier to skim and harder to audit carefully.

The result is a throughput bottleneck with a lag. Individual throughput increases; review capacity stays flat; pull requests queue up; review quality degrades under volume pressure; more bugs ship. The team measures productivity by commits per developer per week and sees the metric improve. Users measure productivity by how reliably the software works and may experience something different.

This dynamic is not hypothetical. The DORA State of DevOps research, which has tracked software delivery performance across thousands of organizations for over a decade, consistently finds that technical practices like thorough code review, automated testing, and small incremental changes predict high performance far more strongly than any specific tool adoption. AI coding tools do not change what predicts success; they make it easier to generate volume that bypasses the practices that make software reliable.

The Security Debt That Accumulates Quietly

Code quality problems compound over time in ways that do not appear in short-term productivity measurements.

A 2021 study from researchers at Stanford and NYU found that developers assigned security-sensitive coding tasks produced insecure code at a meaningfully higher rate when they used an AI coding assistant than when they worked without one. The failure mode was not obvious: the AI did not generate visibly broken code. It generated plausible code that focused the developer’s attention on the suggested approach, reducing the careful scrutiny that writing from scratch would have prompted. Common vulnerability classes in AI-assisted code include SQL injection via string concatenation, path traversal from unsanitized file inputs, and missing authentication checks where the generated function body looked complete but skipped a critical validation step.

These vulnerabilities do not appear in task-completion benchmarks because benchmark tasks are not security-sensitive. They accumulate in production codebases and surface in security audits, incident reports, and CVEs months or years after the code was merged.

What Productivity Actually Looks Like

The SPACE framework, developed by Nicole Forsgren, Margaret-Anne Storey, and collaborators and published in ACM Queue in 2021, offers a more complete picture of developer productivity across five dimensions: Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, and Efficiency and flow. The argument is not that all five must be measured simultaneously, but that any single-dimension measurement tells you something incomplete and often misleading.

Activity metrics — things like commits per day, lines of code written, and tasks completed per sprint — are the easiest to collect and the metrics AI tools most directly improve. Performance metrics like defect rates, deployment success rates, and mean time to restore are harder to measure and improve more slowly or not at all. Satisfaction metrics, which matter for retention and sustained output, are harder still and frequently ignored.

When an organization measures the impact of AI coding tools using activity metrics alone, it will see improvement almost by definition. The tools are designed to increase activity. Whether that activity translates into performance and satisfaction is an empirical question, and the answers currently available are less optimistic than vendor research suggests.

Where Codegen Is Genuinely Useful

None of this argues that AI coding tools provide no value. They handle a real category of work well: boilerplate, scaffolding, repetitive structural transformations, integrations between APIs with well-documented behavior, and tests for straightforward logic. For a developer who knows precisely what they want and can evaluate what they receive, the tools save time on the mechanical parts of translating intent to code.

The critical qualifier is “can evaluate what they receive.” The productivity gain is real for developers who read generated code carefully, understand the domain well enough to spot errors, and treat AI output as a draft rather than a deliverable. The gain erodes, or turns negative, when generated code gets merged without that review, when the team lacks the context to audit what was produced, or when velocity pressure makes careful review feel optional.

Building Discord bots and working on systems code, I find AI generation useful for the genuinely mechanical parts: message schema parsing, command dispatch tables, API wrapper boilerplate. The parts that require understanding state transitions, race conditions in async event loops, or the subtle protocol behaviors in Gateway reconnect logic are where AI-generated code most often fails in ways that look fine until they do not. That is not a failure of the tools; it is a description of where the difficulty actually lives.

The Measurement Trap

The broader problem Antifound identifies is one of measurement selection. When a metric is easy to improve and gets tracked, organizations optimize for it, and the metric decouples from the underlying thing it was supposed to represent. Lines of code were recognized as a flawed productivity metric in the 1970s, and yet organizations still default to output-based proxies because they are easy to collect.

AI coding tools make output-based metrics very easy to improve. They do this by accelerating the translation step without necessarily improving anything before or after it. The requirements are still unclear. The architecture is still discovered through implementation. The review is still limited by human attention and bandwidth. The production behavior is still determined by how well the code was understood, not how quickly it was written.

Faster code generation is a real improvement in one dimension. Calling it a productivity improvement requires a narrow definition of productivity, one that the field has been arguing against for fifty years, and one that AI tools are now optimizing at scale.

Was this interesting?