· 6 min read ·

The Middle Loop Borrows Against Skills It's Helping You Forget

Source: martinfowler

Annie Vella spent time studying how 158 professional software engineers actually work with AI tools, and her research, summarized by Martin Fowler, lands on a framing worth taking seriously: the dominant activity shifting for engineers is not creation but something Vella calls supervisory engineering work. Directing AI, evaluating its output, correcting it when it goes wrong. She proposes this lives in a new loop, the middle loop, sitting between the inner loop of writing and debugging code and the outer loop of commits, review, CI, and deployment.

The framing is useful because it names something that many engineers have been experiencing without vocabulary for it. The middle loop framing also has a dependency that the current conversation around it mostly skips over: supervisory engineering quality is not free-standing. It borrows from the same skills that AI tooling is actively helping engineers practice less.

What the Loops Actually Are

The inner loop is write-build-test-debug, running in seconds to minutes. The outer loop is commit-review-CI-deploy-observe, running in hours to days. These labels have been in circulation in developer productivity discussions for a while; Microsoft’s developer experience documentation has used this framing explicitly, and the SPACE framework from Forsgren et al. treats inner loop feedback latency as a distinct dimension of developer performance.

What the inner loop provides, beyond producing code, is a dense feedback channel. When your compiler rejects a type mismatch in three seconds, or your test fails with a stack trace pointing directly to the broken assumption, you are building a mental model of the system through high-frequency error. That model accumulates over years of inner loop work. It is what lets you look at a 200-line diff and sense, before reading carefully, where the complexity is hiding.

The middle loop, as Vella and Fowler describe it, is the layer where engineers prompt an AI system, read what it produces, and decide whether it is correct. Cadence is faster than a code review, slower than autocomplete. There is no compilation step, no test output, no stack trace; there is only the engineer’s judgment applied to code that looks authoritative because a fluent model generated it.

The Ironies of Automation, Applied

Lisanne Bainbridge published a paper in 1983 about industrial process control systems, titled “Ironies of Automation.” The core argument was precise and uncomfortable: the more reliable an automated system, the less practice human operators get at the manual skill the system replaced, which means that when the automation fails and manual intervention is needed, the least prepared person for the task is the operator whose entire job is now to supervise it.

Aviation worked through this problem empirically and at significant cost. Studies like Casner, Geven, and Williams in Human Factors documented measurable degradation in manual flying skills among commercial pilots with heavy autopilot reliance. The FAA eventually issued formal guidance on automation dependency. Crew Resource Management became a formal discipline. Manual flying requirements were reintroduced specifically to maintain the skill base that made automated flight supervision meaningful.

Software engineering is in the early phase of this same discovery. BCG’s “jagged frontier” research found that AI-assisted workers performed worse than unassisted on tasks outside the AI’s capability boundary, specifically because they could not tell which side of the boundary they were on. That failure is a supervision failure, and supervision failures trace back to gaps in the supervisory knowledge base.

What Supervisory Engineering Actually Requires

The specific cognitive demand of middle loop work is adversarial reading: approaching generated code looking for what is wrong, not confirming what looks right. This is not how people read code by default, especially code that is syntactically clean, idiomatic, and apparently complete.

The Purdue study from 2024 found that roughly 40% of GitHub Copilot suggestions contained errors, while developers accepted them at high rates without catching those errors. GitClear’s analysis of 211 million lines of code found elevated code churn in AI-assisted codebases, code that was written, merged, and substantially reworked or reverted within weeks. Throughput went up while durability went down.

The gap between what the middle loop demands and what engineers are currently doing in it is not a tooling problem in any simple sense. Detecting that a generated function handles an edge case incorrectly requires understanding what the correct behavior is. Recognizing that the AI chose the wrong abstraction for a module requires having built enough similar modules to know what good abstraction choices look like. Supervisory skill is downstream of domain knowledge, and domain knowledge accumulates through the inner loop work that is increasingly being delegated away.

Addy Osmani’s analysis of what he calls the 70% problem points at a related pattern: AI gets you to a rough functional prototype quickly, and then the cost of the remaining work, edge cases, integration, correctness under real conditions, often exceeds the initial time saved. The cost moved into the middle loop and became less visible because outer loop metrics look fine while the accumulated debt compounds.

The Measurement Problem

The outer loop has DORA metrics: deployment frequency, lead time for changes, change failure rate, mean time to recovery. These measure outcomes, laggingly, at the organizational boundary. An engineer who catches every AI error before commit and one who catches none look identical by DORA until the change failure rate moves months later.

There is no established measure for middle loop precision: what fraction of AI outputs an engineer correctly identified as wrong before committing, weighted by the severity of the errors they missed. The only coarse, lagging signal is something like GitClear’s code churn data. The inner loop, by contrast, is instrumented continuously by the build system and test runner. Every failure is timestamped and attributed. Engineers working primarily in the middle loop accumulate expertise in a lower-density feedback environment, and the field has not built tooling to compensate for that.

Kief Morris’s related work on humans and agents in software engineering loops, also on the Fowler site, frames this as the human’s role shifting to managing the loop rather than being in it: designing the feedback cycle, defining what correctness means before the agent runs, setting checkpoints that make evaluation tractable. That framing is correct, but designing a feedback cycle for an AI system requires understanding what failure modes look like, and that understanding comes from having been in the inner loop long enough to recognize them.

The Junior Engineer Problem

Vella’s research finished in April 2025, and Fowler’s observation is that subsequent model improvements have accelerated rather than reversed the trends she documented. If that is right, the most consequential long-term effect is on engineers entering the field now.

The inner loop was where foundational engineering intuition formed. Debugging forced mental models of system execution. Stack traces taught causality tracing. Writing a function incorrectly and watching it fail under a test you also wrote created the tight feedback that made expertise durable. Engineers who enter the middle loop without first accumulating that foundation are supervising AI outputs without the knowledge base that makes supervision reliable.

Aviation addressed this by making manual flight practice mandatory, not optional. The principle was that competent supervision of automated systems required the ability to do the supervised task without automation. Software engineering has not developed an equivalent principle, and the current hiring and leveling systems, still calibrated around inner loop competence demonstrated through interview exercises, are not measuring middle loop skill either.

The middle loop is real, and the Vella/Fowler framing of supervisory engineering names something genuinely new. The part that needs more attention is what supervisory engineering depends on, and how teams plan to maintain the foundation that makes supervision meaningful when the inner loop that built it is being handed to the model.

Was this interesting?