The Middle Loop: What Supervising AI Code Actually Demands

The research that caught my attention this week comes from Annie Vella, who spent time studying 158 professional software engineers and how they work alongside AI tools. Her central finding is something many people sense but haven’t had a name for: the job is shifting from creation to supervision, and that shift is more structurally significant than it first appears.

Vella’s term for it is “supervisory engineering work” — the effort required to direct AI, evaluate its output, and correct it when it’s wrong. Martin Fowler picked up on her work in his March 16 fragments post and added a framing that is genuinely useful: the idea of a middle loop.

Most engineers think in terms of two loops. The inner loop is the fast cycle: write code, run tests, debug, iterate. The outer loop is the slower one: commit, review, CI/CD, deploy, observe production, respond to incidents. These two loops have been stable abstractions for a long time. The inner loop runs in minutes; the outer loop runs in hours to days.

AI tools are eating the inner loop. Code generation, test generation, automated debugging suggestions — these are all inner loop activities, and they’re being absorbed into AI assistance at a rapid pace. But someone still has to decide what to build, evaluate whether the AI’s output is correct, catch the cases where it confidently generated something subtly wrong, and redirect the next iteration. That work lives neither in the old inner loop nor in the outer loop. It is a new layer: the middle loop.

The middle loop runs in minutes to an hour. It is the cycle of prompting, evaluating, accepting or rejecting, and re-prompting. This is supervisory work in the same sense that an engineering manager’s work is supervisory: you are not doing the task yourself, but you are responsible for the quality of what comes out.

What the Middle Loop Actually Demands

Supervisory work sounds easier than creation work until you examine what it requires. To verify that AI-generated code is correct, you need to understand:

What correct actually means for this problem, which requires the domain model clearly in your head
The common failure modes of AI code generation: hallucinated APIs, subtle logic errors in edge cases, security vulnerabilities that look plausible at a glance
Whether the tests the AI wrote actually test the behavior rather than just passing against the AI’s own implementation
When to accept the output, when to ask for a revision, and when to discard it entirely and write the thing yourself

None of these are beginner skills. Effective supervisory engineering requires the same deep technical foundation that effective code authorship requires, plus an additional set of skills specific to evaluating AI output. You need to know what wrong looks like in order to catch it.

The GitClear research from early 2024 documented that AI-assisted codebases showed meaningfully higher rates of code churn — code that was written and then reverted or significantly modified shortly after. That is a signal that supervisory work is not always being done effectively. Engineers are accepting AI output that later turns out to be wrong, and the correction is expensive. The verification step is precisely where productivity gains can evaporate.

The Junior Engineer Problem

The inner loop, which AI is absorbing, is where junior engineers have traditionally developed their craft. Writing a function from scratch, getting a test to fail in the right way, debugging a subtle concurrency issue — these are how you build the mental models that make supervisory work possible later. The repetitive mechanical work of early-career software development is not wasted time; it is the accumulation of pattern recognition that middle-loop judgment will eventually depend on.

If junior engineers spend their formative years primarily in the middle loop — reviewing AI output, accepting or rejecting generated code, writing prompts — do they develop the underlying models they will need to supervise effectively at higher levels of complexity? The honest answer is that nobody knows yet, and anyone claiming confidence about it is not being careful.

The aviation analogy is well-worn but apt here. Modern aircraft have sophisticated autopilot systems that handle the mechanics of flight in cruise. Pilots still train extensively on manual flight, because the autopilot fails and because the judgment to recognize when the autopilot is doing something wrong requires deep understanding of what it should be doing. Supervisory skill depends on latent creation skill, even when creation skill is not being actively exercised.

Vella’s research finished in April 2025. Fowler notes that the latest generation of models has improved substantially since then, and his sense is that this improvement accelerates the supervisory shift rather than reversing it. That seems right. Better models means more of the inner loop gets automated, not less. The middle loop grows in scope as models become capable enough to handle more complex generation tasks. The question of how junior engineers build the foundations for that supervision becomes more urgent, not less, as the tools improve.

What Changes About How We Work

Code review changes character in a middle-loop world. In the current outer loop, code review evaluates whether a human engineer made good decisions. When AI generated the code and a human approved it, the reviewer is evaluating both the AI’s output and the human supervisor’s judgment in accepting it. That is closer to auditing than reviewing.

The ability to write clear, precise specifications becomes more valuable. AI tools generate code that is consistent with the prompt they received. If the prompt was ambiguous or incomplete, the code reflects that ambiguity. Engineers who can specify exactly what they want — in detail, with edge cases made explicit — get better output. This is a form of requirements engineering the industry has historically been bad at and has gotten away with because the humans filling in the gaps were themselves engineers who could infer intent. That inference no longer happens automatically.

Testing strategy shifts upstream as well. If you are generating code rather than writing it, tests can no longer be an afterthought. You need to know what correct looks like before you can evaluate whether the generated code achieves it. Test-driven development in an AI-assisted workflow is not a methodology preference; it is a practical prerequisite for effective supervision. Write the tests first and you have a specification the AI can be evaluated against. Skip them and you are trusting your post-hoc review to catch everything.

The Identity Question

The phrase “traumatic change” in Fowler’s summary of Vella’s research is doing real work. The shift from creation to supervision is not just a change in the mechanics of the job; it is a change in the identity structure of the profession. Most engineers became engineers because they found satisfaction in building things. There is a particular kind of flow state available when writing code that supervisory work does not replicate. Supervisory engineering is satisfying in different ways and frustrating in the ways that creation work is not. You are fighting a different kind of friction.

None of this means the middle loop is worse work. It may ultimately be higher-leverage work — catching a wrong architectural decision in an AI-generated PR is more valuable than having written the correct architecture yourself. But it requires rethinking what expertise means in this profession, what we should be training, and what we should be hiring for.

The middle loop is real, it is growing, and the skills it demands are not the same as the skills it is replacing. What stays constant is the need to understand systems deeply. The middle loop is no place for someone who cannot tell good code from plausible-looking bad code, and building that discrimination still requires the kind of hands-on creation experience that AI tools are now automating away.