Annie Vella spent time studying 158 professional software engineers to understand what AI tools were actually doing to their work. The central question she posed is worth quoting directly: “Are AI tools shifting where engineers actually spend their time and effort? Because if they are, they’re implicitly shifting what skills we practice and, ultimately, the definition of the role itself.”
She found that engineers had shifted from creation-oriented work to verification-oriented work. But not the kind of verification that maps to code review or testing. She proposed calling it supervisory engineering work: the effort required to direct AI, evaluate its output, and correct it when it’s wrong. Martin Fowler picked up on her framing in a recent fragment and added a spatial metaphor that makes the structural issue visible.
Three Loops, Not Two
Software engineers have long worked in two loops. The inner loop is the tight, fast cycle on your local machine: write code, build, run tests, debug, repeat. It operates in seconds to minutes. The outer loop is slower and team-facing: commit, open a pull request, wait for CI, get a review, merge, deploy, observe. It runs on a timescale of hours to days.
Fowler’s contribution to Vella’s framing is the observation that supervisory engineering work belongs to a third loop sitting between these two, a middle loop. AI increasingly automates the inner loop, handling code generation, the build-test cycle, and debugging. But a person still has to direct that work, evaluate what the model produced, and catch what it got wrong. That is a loop of its own, with its own rhythm and its own failure modes.
What makes this worth naming as a separate loop, rather than just a variant of code review, is the difference in cognitive situation. In code review you’re reading work produced by a colleague who understands the domain, can be asked questions, and signaled their intent through comments, variable naming, and commit messages. In the middle loop you’re evaluating output from a system whose reasoning is not accessible and which cannot tell you what it was uncertain about. The surface looks similar. The epistemic situation is different.
The Specification Problem
In the inner loop, understanding tends to form through the act of building. You write a function, the test fails because of an unhandled nil pointer, and you reshape your mental model in response. Requirements and architectural constraints get discovered as you collide with them. The loop is dense with feedback: compilers catch type errors, test runners surface wrong assumptions, the profiler shows you where your mental model of the performance profile was incorrect.
The middle loop requires the opposite. You have to specify intent before code exists. The AI generates from a prompt that should already be complete, and gaps in that specification don’t announce themselves. They produce plausible-looking code that works for the common case and fails for the cases you didn’t think to mention.
Consider a concrete example. You prompt a model to implement a feature that updates a record based on user input. The model produces clean, well-structured code that handles the happy path correctly. What you didn’t specify is that concurrent updates are possible and need to be serialized. The model has no way to know this constraint exists unless you tell it; the code it writes is locally plausible given your prompt. In the inner loop, you’d likely encounter this problem because you’d be inside the implementation, and the collision with your concurrency assumptions would force you to address it. In the middle loop, the bug is in what was not asked, not in the quality of what was generated.
Addy Osmani’s observations on the “70% problem” describe a related dynamic: AI gets you to functional quickly, and then the remaining work, edge cases, integration, correctness under real conditions, costs more time than the initial generation saved. The ratio depends heavily on how well you specified your intent before code generation began.
The Fluency Trap
AI-generated code has a property that makes middle-loop verification harder than it might appear. It is syntactically fluent. The naming is consistent, the formatting is correct, the idioms are appropriate for the language. It reads like it was written by a competent engineer. Code review has historically leaned on stylistic signals as proxies for confidence: well-structured functions with clear naming carry a higher prior probability of being correct. AI output breaks this heuristic, because fluency is a property of the generation process, not a signal of semantic correctness.
The mistakes tend to be semantic rather than syntactic. A 2024 Purdue University study found that roughly 40% of GitHub Copilot suggestions contained errors, and developers accepted them at high rates. The errors weren’t obvious; they were structurally indistinguishable from correct code at the surface level. A separate Stanford study found that developers using AI coding assistants were more likely to introduce security vulnerabilities, particularly through the omission of defensive checks that experienced engineers write habitually.
Those habits matter. They’re accumulated inner-loop specifications, tacit knowledge built through years of encountering the failure modes that result from not applying them. When engineers who generated that tacit knowledge supervise AI systems, they can notice when a generated function is missing a bounds check or skipping error handling. Engineers who have primarily worked in the middle loop from early in their careers will have less of that tacit knowledge to draw on, making their verification less reliable in exactly the situations where verification is most consequential.
What Automation Research Has Already Said About This
This structural dynamic is not new to automation theory. Lisanne Bainbridge described it in a 1983 paper called “Ironies of Automation” in the journal Automatica. Her first irony is that the more reliable an automated system becomes, the worse the human operator gets at the task being automated, because they have fewer opportunities to practice it. Her second irony is that automated systems tend to fail in unusual situations, the edge cases where judgment matters most, and humans in monitoring roles are in the wrong mental state for rapid high-stakes problem-solving when automation fails. Passive vigilance is not the same state as active execution.
Aviation has accumulated concrete evidence of this. The FAA documented measurable degradation in manual flying skills among commercial pilots relying heavily on autopilot, and investigations into incidents like Air France 447 cited degraded manual skills as contributing factors. In 2013 the FAA issued Safety Alert SAFO 13002, explicitly recommending that pilots practice manual flight more frequently. Airlines built minimum manual flying requirements into recurrency training.
The parallel is direct. The inner-loop skills AI is automating are also the skills that make middle-loop verification reliable. Using AI to replace inner-loop work while depending on inner-loop expertise to evaluate the output is structurally unstable. The skills don’t sustain themselves as a side effect of supervisory work; they require deliberate practice.
Better Models Make This Harder, Not Easier
Vella’s research concluded in April 2025. Fowler notes that model improvements since then have accelerated the shift toward supervisory engineering rather than reversing it, and the mechanism is worth understanding.
A model that is wrong 40% of the time produces errors that are often obvious: wrong return types, logic that breaks on the first test case. A model that is wrong 10% of the time produces errors that are structurally indistinguishable from correct code. They’re edge cases, race conditions, off-by-one bugs, assumptions that fail for specific input combinations. These require reasoning about correctness at the level of intent rather than at the level of structure, and they’re harder to catch precisely because better models produce errors that look exactly like correct code.
Better models also expand scope. Early AI tools were reliable on well-scoped, high-frequency patterns: boilerplate, standard data structures, common API usage. Better models are increasingly applied to higher-stakes territory: authentication logic, cross-cutting refactors, security-sensitive code. BCG research on GPT-4 described a “jagged frontier” where model capability is uneven across task types, and as the frontier expands it becomes harder for engineers to know which side of it they’re on for a given task. The supervisory burden scales with the stakes of what the model is attempting.
GitClear’s analysis of 211 million lines of code found increasing churn in repositories with heavy AI usage, meaning code written and then substantially revised within two weeks. Churn is a signal that verification is not keeping pace with generation speed. The errors pass initial review and surface during integration or testing.
The Infrastructure Gap
The inner loop has decades of feedback infrastructure. Compilers, type checkers, test runners, linters, profilers: all designed to close feedback loops quickly and specifically, to tell you immediately if something was wrong and where to look. The outer loop has CI/CD pipelines, code review conventions that have evolved over thirty years, monitoring, observability systems. Both loops have mature tooling because engineers recognized them as loops worth investing in.
The middle loop has almost none of this. Evaluation is largely manual: read the generated code, run the tests if they exist, reason about whether the approach is sound. There is no tool that tells you whether your prompt was well-specified. There is no mechanism for tracking where models have been consistently wrong on tasks similar to yours. There is no diff interface designed specifically for comparing AI output against intent rather than against a previous version. Prompt engineering best practices are scattered across blog posts and informal community knowledge, not grounded in the kind of systematic study that produced structured code review practices.
The tools that exist, LangSmith and similar evaluation frameworks, are aimed at evaluating AI systems at scale, not at helping individual engineers calibrate their supervisory judgment in the moment. That is a different problem.
What This Demands
Vella’s research identified a shift that is real and structural. Naming it as supervisory engineering work is useful because it clarifies that this is not just code review with a different author. It’s a different cognitive task with different failure modes, and it currently runs on infrastructure that doesn’t exist yet.
Engineers who want to be effective in the middle loop have two independent problems. The first is building the tooling and conventions for middle-loop work: interfaces that make model reasoning transparent, test harnesses that probe the edge cases models tend to miss, ways to systematically track verification accuracy over time. The second is maintaining the inner-loop fluency that makes verification tractable in the first place, through deliberate practice on tasks where you’re generating code yourself, navigating codebases manually, and staying inside problems long enough to encounter their edge cases.
The aviation response to automation-induced skill degradation was not primarily tooling. It was deliberately engineering humans back into the loop on a schedule, mandating manual practice not as a backup skill but as an ongoing maintenance activity. The structural logic applies here. A middle loop built on top of deteriorating inner-loop expertise is not a stable architecture.