The Middle Loop: How AI Supervision Became a New Engineering Discipline

Annie Vella’s research into how 158 professional software engineers actually use AI tools surfaced a finding that feels obvious in hindsight but is rarely stated this directly: AI is not just speeding up the inner loop, it is creating an entirely new loop that didn’t exist before. Martin Fowler picked up her work and gave it a name that stuck: the middle loop.

The inner loop and outer loop model has been around for at least a decade in DevOps and developer tooling circles. The inner loop is the tight local cycle: write code, build, run tests, debug, repeat. It happens in seconds or minutes, entirely on your machine, and it’s the work that feels most like programming. The outer loop is the slower institutional cycle: commit, open a pull request, wait for CI, get reviewed, merge, deploy, observe. It happens over hours or days and involves other people and systems.

This two-tier model maps cleanly onto the traditional shape of software work. The inner loop is where skill is expressed. The outer loop is where quality is verified and value is shipped. Tools over the last decade have relentlessly compressed both: faster builds, incremental testing, trunk-based development, feature flags that decouple deploy from release.

AI coding assistants don’t fit neatly into either tier. They operate at inner-loop speed but they introduce a verification burden that is qualitatively different from debugging your own code. When you write a function yourself, you understand why every line is there. When a model generates a function, you know what it is supposed to do but you must reconstruct the understanding of why it was built this way before you can trust it. That reconstruction takes time, and it requires a different kind of attention than writing.

Vella describes this as supervisory engineering work: the effort of directing AI, evaluating its output, and correcting it when it’s wrong. Her study found participants perceived a clear shift away from creation-oriented tasks toward this verification-oriented mode. This is not the same as code review in the outer loop, which is adversarial, asynchronous, and focused on whether the change belongs in the codebase at all. Supervisory work is continuous, synchronous, and focused on whether the code is even approximately correct before it becomes a candidate for review.

Why This Structural Shape Is Familiar

Every major automation wave in software history has followed the same pattern. A tool takes over an execution task that developers were doing by hand, but introduces a new supervisory layer that requires different skills.

When compilers replaced hand-written assembly in the 1950s and 1960s, the initial objection was that compilers produced inefficient code and that real programmers needed low-level control. Grace Hopper’s A-0 compiler and the work that followed eventually answered that objection with performance. But the more important change was structural: programmers stopped being instruction authors and became specification authors. The craft moved up a level. Optimizing register allocation stopped being a core skill; designing algorithms and data structures became the core skill.

When CI/CD pipelines replaced manual testing gatekeeping in the 2000s and 2010s, the execution of test suites became automated. But this didn’t eliminate the verification work. It shifted it. Engineers who previously ran tests became engineers who designed, wrote, and maintained test infrastructure. The outer loop shortened, but the cognitive work moved from running things to defining the systems that run things.

AI coding tools are following this same structural trajectory. The inner loop execution — text entry, boilerplate generation, test scaffolding, debugging common errors — is becoming automated. But someone still has to read the output, understand it well enough to trust or distrust it, and decide what to do when it’s wrong. That is the middle loop. It is slower than typing but faster than a PR review cycle. It happens dozens of times per session rather than once per feature.

What Supervisory Engineering Actually Requires

The skills demanded by the middle loop are not a subset of the skills used to write code from scratch. They overlap but they are distinct, and it matters to understand which skills transfer and which do not.

Reading code for comprehension is easier than writing it, but reading AI-generated code for correctness is harder than reading code you wrote yourself, because you have no memory of the reasoning behind the choices. A variable named result in your own code is a shorthand for a mental model you built while writing. A variable named result in generated code is just a string until you reconstruct the intent. This means that fast, accurate code reading is a genuine skill that the middle loop exercises constantly, and it’s one that many developers have underdeveloped because the inner loop rewarded writing speed over reading speed.

Deciding when AI output is trustworthy requires calibrated skepticism. This is not the same as paranoid line-by-line verification of everything, which would eliminate any productivity benefit. It requires pattern recognition for the classes of errors that models produce reliably: plausible-looking but subtly wrong library usage, confident hallucination of APIs that don’t exist, correct logic for the wrong problem, edge cases that weren’t specified but matter. This kind of calibration comes from having made the mistakes yourself and seen what they look like in production.

Prompt specification is a form of requirements writing. If the middle loop involves directing AI, then the skill of stating intent precisely enough that AI can act on it usefully is non-trivial. Vague specifications produce code that passes casual inspection but misses the actual need. Good specification requires the same analytical clarity as good requirements documentation, with the added constraint that it needs to be terse enough to fit in a turn.

The Skill Atrophy Problem

Vella raises a concern that deserves serious weight: if engineers spend most of their time supervising AI output rather than writing code from scratch, the skills required to write code from scratch may atrophy. This has a parallel in GPS navigation research, which has documented decline in spatial navigation ability among heavy GPS users. The skill doesn’t vanish, but it weakens without regular exercise.

For software development, the implications compound in a particular way. Calibrated skepticism about AI output requires knowing what correct code looks like and having models of failure modes. Those models come from having built and debugged complex systems yourself. An engineer who has primarily supervised AI-generated code for five years will have a different failure-mode repertoire than one who spent five years writing code before AI was capable. The former may supervise well in familiar domains and poorly in unfamiliar ones, with less ability to distinguish between the two.

This is not an argument against using AI tools. It is an argument for being deliberate about which skills get practiced and which get delegated. The historical parallels are instructive here too: assembly programmers who understood what compilers were doing had an advantage over those who treated the compiler as a black box. The mental model of the layer below, even when you’re not operating at that layer, produces better judgment about when the automation is trustworthy and when it’s about to fail you.

What Comes Next

Fowler notes that Vella’s research concluded in April 2025, before the most recent round of model improvements. His sense, shared by most people watching this closely, is that stronger models accelerate the shift rather than reversing it. Faster, more capable inner-loop automation means more supervisory work per unit of time, not less.

The two-tier model of software development worked well when the inner and outer loops were clearly separated by scale, speed, and who was involved. The middle loop breaks that separation. It’s personal and fast like the inner loop, but it’s fundamentally about evaluating quality and correctness rather than creating something. The engineering community is still building the vocabulary and the practices for working well at this tier.

Vella’s contribution is to name it clearly enough that we can start to reason about it. The middle loop is not a transitional artifact that will disappear when models get better. It is the new shape of engineering work.