Martin Fowler recently summarized research by Annie Vella, who studied how 158 professional software engineers work with AI tools. Her central finding is deceptively simple: engineers are shifting from creation-oriented work to verification-oriented work. Fowler takes that finding and extends it into a structural proposal: there is now a middle loop sitting between the traditional inner loop (write, build, test, debug) and outer loop (commit, review, CI/CD, deploy, observe).
The framing is clean. But I think it undersells how strange and demanding this new loop actually is, and what it will cost engineers who do not take it seriously.
What the Inner Loop Actually Provided
Writing code has always been productive beyond the obvious reason. It was also epistemically useful. When you write a function, you are forced to resolve every ambiguity before the compiler lets you proceed. The act of writing is a continuous audit of your own understanding. If your mental model of a system has a gap, writing code through that gap either produces an error or produces something that obviously does not work. The act of creation provides feedback on the quality of your knowledge.
This is what cognitive psychologists sometimes call the generation effect: information you produce yourself is retained more durably than information you passively encounter. You do not just write code to ship features; you write code to consolidate and test your understanding of the problem domain.
AI-assisted development is eroding that loop. A 2023 GitHub study showed developers completing tasks 55.8% faster with Copilot. The number is real. What is also real is that the mechanism behind the speedup is AI absorbing the steps that used to force you to resolve ambiguity yourself.
Verification Is Not the Same as Review
Vella’s key distinction is that the verification work in the middle loop is not the same as traditional code review. Code review has well-understood properties: you know the author, you can ask questions, you are reviewing a finished artifact against a known spec, and the review happens after the code has been tested locally. There is a shared context between reviewer and author.
Middle loop verification has none of those properties. The “author” is a language model that has no stake in the outcome and will not be available to explain its reasoning. The code may look entirely correct while containing subtle semantic errors. A 2024 Purdue University study found that GitHub Copilot produced incorrect code in approximately 40% of test cases, while developers accepted suggestions at high rates. The problem is not that the code looks wrong. It is that it looks right.
This is the confidence calibration problem. Human-written code carries epistemic signals: naming uncertainty, defensive edge case handling, comments that mark “I’m not sure about this.” AI-generated code does not. The stylistic confidence of the output is nearly uniform across correct and incorrect generations. A function with an off-by-one error and a correct function are structurally indistinguishable at a glance.
The BCG study on GPT-4 introduced the concept of the “jagged frontier”: AI produces strong gains on tasks within its capability boundary and performance degradation on tasks outside it, because participants could not reliably identify which side of the frontier they were on. Supervisory skill is, in large part, the ability to locate that frontier accurately before committing to an AI-generated output.
The Skills Inversion
The conventional wisdom is that AI tools provide the largest productivity gains to junior engineers. The actual data suggests the opposite. Engineers who already understand a domain deeply can evaluate AI output efficiently; engineers who do not have that foundation accept plausible-looking errors as correct.
This creates an inversion in how domain knowledge functions. In traditional development, domain knowledge was a prerequisite for producing good code. With AI assistance, domain knowledge becomes primarily a prerequisite for evaluating generated code. The production pathway changes; the underlying requirement does not.
What does change is that the feedback loop for building that knowledge atrophies. A junior engineer who grows up primarily supervising AI never experiences the inner loop forcing them to resolve ambiguity. They get the speedup without getting the epistemological audit. This is not a hypothetical concern. Lisbeth Bainbridge articulated the underlying mechanism in 1983, writing about industrial process control, in what she called the Ironies of Automation: the more capable an automated system becomes, the more critical human oversight becomes when automation fails, but the less opportunity humans have to practice and maintain the skills needed for that oversight.
Aviation has grappled with exactly this problem since autopilot systems became standard. The FAA’s analysis of automation dependency found that manual flying skills degraded among pilots who relied heavily on automation, and that degradation was not always visible until an unusual situation required manual intervention. The response was mandatory manual flying requirements, recurrent training, and Crew Resource Management as a distinct discipline covering how to maintain situational awareness while monitoring rather than operating.
Software engineering has no equivalent. There are no mandatory “manual coding” proficiency requirements, no structured recurrent training for the condition where AI-generated code is subtly wrong in security-sensitive contexts.
What the Middle Loop Actually Requires
The supervisory work Vella describes is not just reading generated code carefully. It involves at minimum four distinct competencies:
Intent specification is the work of decomposing an ambiguous requirement into a concrete, verifiable task description. AI tools produce better output when given clearer prompts, but producing clear prompts requires understanding the problem deeply enough to anticipate where ambiguity will cause the model to make wrong assumptions. This is hard to teach and harder to shortcut.
Output evaluation is not syntax review. It covers semantic correctness, security implications, architectural compatibility, and adherence to invariants that are not visible in the immediate context. A Stanford study found that developers using AI coding assistants were more likely to introduce security vulnerabilities than those who were not, specifically because AI-generated code omitted defensive checks that experienced developers write by habit, and reviewers did not catch the omissions.
Correction and steering is diagnosing why the AI produced wrong output and reformulating the prompt or context to redirect it. This requires a theory of what the model is doing, which is not the same as a theory of what the code should do.
Context maintenance is tracking what has been delegated across a session, what constraints apply, and how partial outputs fit together. Agentic tools like Cursor’s Composer or Aider can span dozens of files in minutes. The cognitive overhead of maintaining coherent context across that surface area is the engineer’s responsibility, not the tool’s.
The GitClear analysis of 211 million lines of code found a significant increase in code churn correlating with AI tool adoption. Code was being generated faster than it was being verified correctly before merge. The middle loop was not being executed well.
The Structural Problem with Current Measurement
Frameworks like DORA metrics and the DX Research tooling built around developer experience were developed against the two-loop model. They measure deployment frequency, lead time for changes, change failure rate, mean time to recovery. Those are outer loop metrics. They do not directly measure the quality of middle loop work.
An engineer who accepts AI output uncritically, ships it quickly, and triggers a fast CI pipeline will score well on DORA metrics right up until the change failure rate spikes. The supervisory quality of middle loop work is largely invisible to current instrumentation.
Vella’s research concluded in April 2025. Fowler notes that since then, model capabilities have improved substantially. SWE-bench resolution rates climbed from roughly 12% in early 2024 to above 40% by mid-2025. The inner loop is being absorbed faster than organizations are developing the supervisory practices to handle it. That is the specific risk the middle loop framing surfaces.
Where This Leads
The METR study from 2025 found roughly 20% average time savings from AI assistance, but with high variance. Some tasks improved substantially; others regressed. The regression cases are almost certainly where supervisory work failed: where the engineer did not catch that the AI had gone outside its competence boundary.
The middle loop is not going away. The productive response is to treat it as a first-class engineering discipline rather than an informal adjustment to existing practice. That means developing explicit criteria for evaluating AI output, building in deliberate inner loop practice to maintain the underlying knowledge that evaluation requires, and building measurement infrastructure that makes supervisory quality visible.
Fowler calls this shift traumatic, and that word is calibrated. The definition of competent engineering work is changing in a direction that most career development paths, most team structures, and most productivity measurement frameworks are not yet equipped to support. The middle loop has a name now. Naming it is the easy part.