Martin Fowler’s recent note on Annie Vella’s research is short, but the idea it introduces deserves more unpacking than it gets. Vella studied 158 professional software engineers and found that AI tools are shifting their work from creation to verification, but not the ordinary verification of code review. She named the new pattern “supervisory engineering work”: directing AI, evaluating what it produces, and correcting it when wrong. Fowler maps this onto a spatial metaphor. There is an inner loop, where you write code, run tests, trace failures, and fix bugs. There is an outer loop, where you commit, open pull requests, wait for CI, deploy, and observe. Between them, he proposes, something new is forming: a middle loop where engineers supervise AI doing what they used to do by hand.
The framing is useful. The inner and outer loop vocabulary already had concrete meaning: Microsoft’s developer experience documentation defines the inner loop as the local cycle measured in seconds to minutes, DORA metrics track the outer loop through deployment frequency and change failure rates. The middle loop lands between them temporally: faster than waiting for a PR review, slower than autocomplete. What it lacks is the support structure that both of the other loops developed over decades.
But the observation that most concerns me is Fowler’s aside about model improvements. Vella’s research concluded in April 2025. Since then, AI coding capabilities have advanced substantially. Fowler’s read is that this has accelerated the shift toward supervisory engineering rather than moderated it. That is the right read, and the reason why is worth examining carefully.
What Supervision Actually Requires
The middle loop is not just reviewing code that someone else wrote. The cognitive task is different in a specific way. When you review a colleague’s code, you have access to a person with intent, memory, and context you can query. You can ask why they made a particular choice. Edge cases they skipped were decisions, even if implicit ones. The review is partly a transfer of understanding.
AI-generated code is syntactically fluent without preserving any of that. The model produces code that looks authoritative because fluency is a statistical property of the generation process, not a signal of semantic correctness. A Purdue University study found that roughly 40% of GitHub Copilot suggestions contained errors, with developers accepting them at high rates. A Stanford study on security found that developers using AI coding assistants introduced more security vulnerabilities than those who did not. The heuristics that make code review fast under normal conditions, such as trusting a certain class of pattern, recognizing a colleague’s approach, following the logic of an incremental change, do not transfer cleanly to evaluating AI output at volume.
The specification problem compounds this. In the inner loop, your understanding of a problem and the code that solves it develop together. You hit edge cases during construction, before anything ships. When you prompt an AI instead, you must fully specify what you want before the code exists. Gaps in specification produce plausible-looking output and stay invisible until the wrong moment. The Addy Osmani piece on the 70% problem describes this exactly: AI gets a system to roughly functional quickly, and the remaining work, edge cases, integration, correctness under real conditions, often costs more than the generation saved. That cost moved into the middle loop and became less visible because outer loop metrics looked fine.
Fowler, Rebecca Parsons, and Unmesh Joshi worked through a version of this in an earlier conversation on the what/how loop: the challenge is mapping the “what” precisely enough for AI to generate a correct “how”. That mapping skill is domain knowledge made explicit. Engineers with deep domain experience carry large reserves of implicit specification. Their mental models of the problem space are detailed enough that prompting produces useful output and verification catches meaningful errors. Engineers earlier in their careers have smaller reserves, and AI tools are, as a result, simultaneously most attractive to them and least reliable in their hands.
The Bainbridge Connection
This is not a new pattern. Lisanne Bainbridge’s 1983 paper “Ironies of Automation” described the same structural problem in chemical plants and nuclear facilities. Her first irony: the more reliable automation is, the less practice operators get at the manual skill, so they are least capable exactly when automation fails and most needs them. Her second: automated systems fail in unusual situations, which are precisely the situations where human judgment matters most. Her third: passive monitoring is the wrong cognitive mode for rapid high-stakes problem-solving, and switching modes under pressure is unreliable.
Aviation worked through this for decades. Research on commercial pilots documented measurable degradation in manual flying skills under heavy autopilot reliance. Both the Air France 447 and Asiana 214 accidents had automation dependency as a contributing factor. The FAA responded with explicit guidance recommending more manual flying during training and recurrency programs, essentially building deliberate inner loop practice back into a profession that had optimized it away.
Software engineering has no regulatory equivalent. There is no requirement that engineers who spend their days supervising AI output also spend time building systems from scratch. The BCG “Jagged Frontier” research found that AI-assisted workers performed worse than unassisted ones on tasks outside the AI’s capability boundary, precisely because they could not reliably identify which side of the boundary they were on. That failure mode is a supervision failure, and it traces to a degraded foundation of manual skill.
The generation effect in cognitive psychology captures part of why this matters: information you produce yourself is retained more deeply than information you review. Writing a function, watching it fail, tracing the failure, fixing it, that is a different cognitive event than reading a function and deciding whether it is correct. The inner loop was where most of that deep learning happened. The middle loop inherits the output of that learning without providing the conditions to replace it.
SRE as a Precedent for What’s Missing
Site reliability engineering is, structurally, supervisory engineering over production systems. SREs direct automation, evaluate its behavior, and correct it when wrong. The discipline took roughly twenty years to develop the infrastructure that makes this work: SLOs to define acceptable system behavior, error budgets to formalize the tradeoff between reliability and velocity, blameless postmortems to extract learning from failures, toil recognition to make invisible maintenance work visible, and explicit frameworks for attributing the value of avoiding incidents that never happen. The Google SRE book appeared in 2016, about thirteen years after the discipline was founded at Google around 2003.
The middle loop has none of this yet. There are no SLOs for AI-generated code quality, no error budgets for supervision failures, no systematic postmortem process for AI-introduced bugs that made it to production, no toil recognition for the work of evaluating code that is subtly wrong, no measurement of how reliably individual engineers catch errors before they ship. The career attribution problem is acute: an engineer who consistently identifies subtle AI errors before they reach production provides enormous value that is invisible in output-based performance metrics. A team that generates twice as many PRs with AI assistance and merges them just as quickly is not necessarily moving faster on any measure that matters.
Kief Morris’s piece on humans and agents in software engineering loops argues that the right place for humans is to design the feedback cycle, not to be inside it. That is the SRE insight applied to the development workflow. Designing a feedback cycle well requires understanding what you are measuring, why certain errors are costly, and what warning signs to wire up. That knowledge came from years inside the loop.
The Bootstrapping Problem
Engineers who spent a decade in the inner loop before capable AI coding tools existed have a foundation to supervise from. They built mental models of how code fails, which patterns are fragile, what “complete” actually means in their domain. The middle loop spends down that foundation without replenishing it.
Engineers entering the field now face a different situation. If AI handles code generation by default and most working time falls in the middle loop, the training that makes middle loop supervision reliable never happens. The field has no answer to this. Hiring processes still test inner loop skills, but the incentive to build those skills deliberately is eroding as AI tools make generating a working solution increasingly accessible without them.
This is not an argument against AI coding tools. The productivity evidence is real, even if controlled study numbers overstate it compared to full-workflow measurements. The argument is that Fowler’s middle loop is not self-sustaining. It depends on a knowledge base built in the inner loop, and as AI takes more of the inner loop, the field needs to think explicitly about how that knowledge base gets rebuilt, not just consumed.