· 6 min read ·

The Middle Loop Runs on Skills Built by the Inner Loop

Source: martinfowler

There is a structural tension in the emerging model of AI-assisted software engineering that Martin Fowler surfaced recently, drawing from Annie Vella’s thesis research on 158 professional software engineers. The short version: AI is taking over the inner loop (write, build, test, debug), and a new “middle loop” is forming in its place, one where engineers direct AI, evaluate its output, and correct it when it goes wrong. Vella names this work “supervisory engineering.”

The framing is useful. But there is a dependency hiding inside it that does not get named explicitly: competent supervision requires having done the supervised work yourself, or something close to it.

What the Inner Loop Was Actually For

The inner loop vocabulary entered engineering culture through developer experience research, popularized heavily by Microsoft’s developer tooling teams and formalized in frameworks like the SPACE model and DORA metrics. The canonical definition: write code, build, test, debug. Tight, local, fast, measured in seconds to minutes, happening entirely on your own machine.

But the inner loop was never just a productivity metric. It was where engineering intuition got formed. The cognitive psychology literature has a name for this: the generation effect. Information you produce yourself is retained more deeply and durably than information you read or review. Writing a function, watching it fail, tracing the failure, fixing it, and watching it pass is a different cognitive event than reading a function someone else wrote and deciding whether it looks correct. Both are forms of knowing, but they build different things.

The inner loop was the training loop. Not deliberately, not by design, but functionally. Engineers who spent years grinding through it developed pattern recognition that made them effective at everything else. The outer loop (commit, review, CI/CD, deploy, observe), where the social and systemic feedback lives, depends on participants who developed their judgment in the inner one.

The Ironies of Automation

This is not a new problem. In 1983, Lisanne Bainbridge published a paper titled “Ironies of Automation” about nuclear plant operators and industrial process control. Her central observation was that increasing automation creates a paradox: the more reliable the automation, the less practice the human operator gets, and therefore the less capable they become at the thing automation most needs them to do, which is to take over when automation fails.

Aviation research documented the same pattern in pilots. Studies on autopilot dependence found that manual flying skills degraded measurably when pilots spent most of their time monitoring rather than flying. The FAA has published guidance on automation dependency risks for this reason.

Software engineering is not nuclear plant operation in terms of stakes, but the structural dynamic is the same. Supervisory work requires understanding what correct behavior looks like, catching subtle failures in plausible-looking output, and knowing when to override. Those capabilities were built by doing the work. When the work gets automated away before the capabilities are built, supervision becomes pattern matching against output that looks authoritative without giving the supervisor grounds to evaluate it.

What Supervisory Engineering Actually Requires

The distinction Vella draws between supervisory engineering and conventional code review is worth taking seriously. Code review assumes the reviewer understands the intent behind the code, has some shared context with the author, and can at least discuss the design decisions with the person who made them. Supervisory engineering on AI output has different preconditions. The AI has no intent in any meaningful sense; it has pattern completion. Plausible-looking code can be wrong in ways that require domain understanding to catch, not just familiarity with the codebase’s conventions.

The empirical data on how often AI code is wrong is not encouraging. A 2024 Purdue University study found that roughly 40% of GitHub Copilot suggestions contained errors, and developers accepted them at high rates. The GitClear analysis of 211 million lines of code found elevated churn rates correlating with AI tool adoption, copy-paste patterns increasing, and deliberate refactoring decreasing. A Stanford study on security found developers using AI coding assistants were more likely to introduce vulnerabilities.

The BCG “jagged frontier” research found something particularly relevant here: AI-assisted workers performed worse on tasks outside the AI’s capability boundary because they could not reliably identify which side of the boundary they were on. Supervisory engineering requires knowing where that boundary is. That knowledge is harder to develop if you have not spent time inside the loop you are now supervising.

The Productivity Gap and What It Signals

The difference between controlled and real-world productivity numbers is worth examining. The GitHub Copilot study from Peng et al. found 55.8% faster task completion on controlled inner-loop tasks. The METR study on real open-source developer tasks found approximately 20% average time savings across the full workflow.

That gap lives somewhere. Part of it is context-switching and prompt iteration. Part of it is verification overhead. Part of it is the correction cycles that controlled studies did not need to capture because tasks were scoped to be within the AI’s reliable range. Addy Osmani has called this the 70% problem: AI reaches approximately 70% functional quickly, and the remaining 30% often costs more than the first 70% saved. That cost is paid in the middle loop, and it is paid by whoever is supervising.

At the outer loop, there is already measurable signal. Projects like Express.js have had to respond to floods of AI-generated pull requests that shift review burden without providing proportional value. Generating a patch has become nearly free; reviewing it has not.

The Bootstrapping Problem

Fowler notes that Vella’s research concluded in April 2025, before the current generation of models, and that model improvements have probably accelerated the shift toward supervisory engineering rather than reversed it. That seems right. But it also means the bootstrapping problem compounds over time.

A software engineer who spent ten years in the inner loop before models got good has a foundation to supervise from. They have seen enough broken code, failed tests, and subtle bugs to recognize when AI output looks right but is wrong. That pattern recognition did not come from reading; it came from generating, failing, and correcting, which is exactly the generation effect in operation.

The more concerning scenario involves engineers who enter the field after inner-loop automation is already established. They will spend most of their time in the middle loop, directing and evaluating AI output, without the training that makes that evaluation reliable. The Fowler, Parsons, and Joshi conversation from January 2026, documented here, frames part of this as a specification problem: the “what” needs to be stated precisely before the AI can reliably generate the “how.” Specifying precisely what you want from a system requires understanding the system well enough to know what correct looks like. That understanding does not come from watching the AI build it.

What This Means in Practice

The middle loop is real. Vella’s framing of supervisory engineering as a distinct category of work is useful and probably accurate. The shift from creation-oriented to verification-oriented work is happening, and it is not simply a relabeling of existing review practices. Directing AI, evaluating its output across a domain you understand well, and correcting it precisely enough that the next iteration improves rather than drifts is skilled work.

But the middle loop is not a self-sufficient loop. It runs on capabilities that the inner loop built, and those capabilities have a shelf life if they stop being exercised. Any team thinking seriously about AI-assisted engineering workflow has to ask not just how to make the middle loop efficient, but how to ensure the people running it have the foundation to run it well. That may mean preserving some inner-loop work deliberately, not because it is more efficient than letting AI do it, but because the engineers who will supervise AI need to develop the judgment that makes supervision reliable.

Bainbridge wrote about this in 1983 in a different industry. The irony she identified still holds: the better the automation, the more critical the human’s understanding of what the automation is doing, and the fewer natural opportunities there are to develop that understanding. The middle loop is a real advance in how we think about what engineers do. The question it leaves open is how engineers who work primarily in it will develop the expertise it requires.

Was this interesting?