The Middle Loop: What Supervising AI Actually Demands of Software Engineers

Martin Fowler recently highlighted research by Annie Vella into how 158 professional software engineers are actually using AI tools day to day. The central finding is not surprising if you’ve been paying attention: engineers are spending less time creating and more time verifying. But Vella’s framing of what that verification looks like is sharper than most commentary on the subject.

She calls it “supervisory engineering work” — the effort required to direct AI, evaluate its output, and correct it when it’s wrong. Fowler extends this into a spatial metaphor: a middle loop that sits between the inner loop (writing code, running tests, debugging) and the outer loop (committing, reviewing, deploying, observing). AI is absorbing the inner loop. Someone still has to manage the process, and that’s the middle loop.

The metaphor is tidy and I think it’s largely right. But it leaves open a harder question: what does it actually mean to be good at supervising AI, and does the profession have any serious way to develop or assess that skill?

What the Loops Actually Mean

The inner/outer loop framing has been in use at places like Microsoft for years, mostly in the context of developer experience and tooling. The inner loop is local: you’re in your editor, you write a function, you run the tests, you read the stack trace, you fix the bug. It’s tight, fast, and personal. The feedback cycles are seconds to minutes.

The outer loop is organizational: you open a pull request, CI runs, reviewers comment, you merge, deployment happens, you watch dashboards. Feedback cycles here are hours to days. The outer loop is where coordination happens and where other humans enter the picture.

The middle loop Vella describes doesn’t have established tooling or established norms yet. You write a prompt. The AI generates a function, a test suite, a refactor. You read it. You decide if it’s correct. If it’s not, you figure out why and re-prompt. This happens at a cadence somewhere between the inner and outer loops — faster than a code review, slower than autocomplete.

What’s interesting is that this middle loop borrows cognitive demands from both existing loops without inheriting either’s support structure. The inner loop has your IDE, your debugger, your local test runner — tooling built up over decades to give you fast feedback. The outer loop has code review culture, CI systems, deployment pipelines, runbooks. The middle loop has… a chat interface and your own judgment.

What Supervision Actually Requires

Here’s the part that gets elided in most takes on AI-assisted engineering: good supervision of AI output requires the same underlying knowledge as doing the work yourself. Maybe more.

When an AI generates a database query, you have to know enough SQL to catch a missing index, a cartesian join, an N+1 lurking behind clean-looking ORM output. When it generates an async function, you have to recognize the subtle cases where it’s dropped error handling or introduced a race condition that tests won’t catch. When it refactors a module, you have to hold in your head what the original invariants were and verify they survived.

None of this is easier than writing the code in the first place. In many cases it’s harder, because you’re evaluating someone else’s work without the exploratory process that would have built your mental model of the problem. You’re reading, not writing, and reading unfamiliar code cold is one of the more demanding things engineers do.

This matters because there’s a version of the supervisory role that’s mostly illusory. You accept AI output without fully understanding it because it looks plausible, tests pass, and review is a formality. That’s not supervision. That’s rubber-stamping. The difference between those two things is entirely in the engineer’s depth of knowledge, and depth of knowledge comes from having done the work yourself.

The Historical Parallel

This isn’t the first time a layer of automation has shifted what engineers are expected to know. Compilers made assembly expertise optional for most programmers, but systems engineers who never learned to read assembly are worse at understanding what their code actually does at the hardware level. ORMs made raw SQL optional for most application developers, but the engineers who can’t write a query without an ORM consistently produce slower applications and have more trouble diagnosing production database problems.

In both cases, the abstraction was real and valuable. Compilers write better machine code than most humans would by hand. ORMs eliminate entire categories of SQL injection vulnerabilities. The productivity gains are genuine. But the abstraction also creates a generation of practitioners who are dependent on the tool in ways that limit what they can reason about.

AI coding assistance is a faster and more comprehensive version of this pattern. The inner loop is being abstracted all at once, not piece by piece. The question is whether the engineers who grow up supervising AI output will have the accumulated low-level knowledge to do it well, or whether they’ll be skilled orchestrators who are epistemically dependent on outputs they can’t fully evaluate.

Vella’s research finished in April 2025, before the most recent wave of model improvements, and Fowler notes that those improvements seem to have accelerated the shift rather than changed its direction. That tracks with my own experience. The more capable the model, the more tempting it is to accept output without scrutiny. Capability earns trust, and trust reduces vigilance.

What Gets Atrophied

If the inner loop is where engineers develop intuition about code, the skill that atrophies first when you stop doing it is probably debugging. Not the act of running a debugger, but the reasoning process: forming a hypothesis about what’s wrong, designing a minimal reproduction, eliminating possibilities systematically. That process is how engineers build a mental model of how their system actually behaves as opposed to how they think it behaves.

AI can produce a fix. It cannot produce your mental model. If you accept fixes without understanding them, the model never updates, and you accumulate a growing gap between your understanding of the system and its actual state. That gap surfaces badly in incidents, in architectural decisions, in anything that requires you to reason about the system from first principles.

The outer loop skills are safer because they’re inherently collaborative and explicit. Code review still requires you to articulate what’s wrong and why. Deployment decisions still require you to weigh tradeoffs and communicate them. Those skills aren’t going anywhere. But the inner loop skills are private and tacit, and they’re the ones at risk.

What the Profession Needs to Figure Out

Hiring is still mostly calibrated around inner loop competence. Technical interviews test whether you can write code from scratch, recognize algorithmic patterns, trace through recursive logic. These are legitimate proxies for the reasoning ability that makes someone a good engineer. But they’re increasingly disconnected from what the day-to-day job looks like if supervisory engineering work is genuinely where engineers spend their time.

The more important question, which the profession hasn’t answered yet, is how you develop and evaluate supervisory skill. It’s not enough to say “hire smart people and they’ll figure it out.” The inner loop gave you continuous, concrete feedback on whether you understood the code. If you wrote a broken function, the tests told you. The middle loop’s feedback is slower and more ambiguous. Bad supervision can persist for a long time before manifesting as a problem.

One thing I suspect will matter a lot is whether engineers maintain some deliberate practice of the inner loop even when AI handles most of it. The analogy I keep coming back to is pilots and autopilot. Commercial aviation has grappled seriously with the question of whether pilots who rely heavily on automation maintain the manual flying skills to handle failures. The answer, documented in incident investigations, is often no. The response has been mandatory manual flying requirements, not acceptance of atrophy.

Software engineering doesn’t have regulatory equivalents, and the stakes of any individual decision are lower. But the structural problem is the same. Supervisory skill is parasitic on the knowledge that comes from doing the supervised work. If you stop doing the work, the knowledge decays, and eventually the supervision degrades with it.

Vella’s coinage of “supervisory engineering” is useful because it gives the phenomenon a name and starts the conversation about what it requires. Fowler’s “middle loop” framing is useful because it locates the new work in relation to existing structures engineers already understand. What comes next is the harder part: figuring out how to build and maintain the knowledge that makes supervision meaningful rather than performative.