Annie Vella’s study of 158 professional software engineers, synthesized by Martin Fowler, introduces the concept of supervisory engineering and the middle loop. The core finding is a shift from creation-oriented to verification-oriented work. What the research establishes clearly is that the ability to supervise AI output well depends heavily on domain expertise. What it leaves less examined is where that expertise comes from, and what happens when the work that used to build it is being automated away.
The Generation Effect in Software Development
Cognitive psychologists have a term for something most developers know intuitively: the generation effect. Information you produce yourself is retained more durably and understood more deeply than information you passively receive. When you write a function, you resolve every ambiguity in your mental model before the compiler lets you proceed. If your understanding of a system has a gap, you discover it by writing through it.
This is why the early stages of a software engineering career look the way they do. Junior engineers write a lot of code, fix bugs they introduced themselves, trace through systems they imperfectly understand, and slowly build the pattern library that makes experienced engineers effective. The frustration and inefficiency are not incidental. They are the mechanism.
AI tools shorten this cycle. A junior developer who once would have spent two hours building a pagination component from first principles, discovering the difference between cursor-based and offset approaches the hard way, can now get a working implementation in thirty seconds. This is described as a productivity gain, and in the immediate-term metric sense it is.
The Expertise Paradox
The problem surfaces when you consider what “working” means for AI-generated code. Vella’s finding is that evaluating AI output correctly requires deep domain knowledge. The BCG jagged frontier study found that AI users who could not accurately locate the boundary of AI’s competence showed actual performance degradation compared to non-AI users; they committed to AI-generated output on tasks where the AI was subtly wrong, without recognizing the frontier had been crossed. Research from Purdue found GitHub Copilot producing incorrect code in roughly 40% of test cases, with high uncritical acceptance rates among developers.
These findings converge on a structural problem: the skill required to evaluate AI output is precisely the skill that would previously have been developed by writing and debugging code manually. AI tools offer the most benefit to engineers who already have that skill. They offer diminishing or negative returns to engineers who do not, but those engineers are also less likely to recognize when the returns are negative.
The conventional wisdom holds that AI tools are democratizing, giving junior developers access to output quality that previously required experience. The data suggests this framing is backwards: AI tools are amplifiers for existing expertise, not substitutes for it.
A Missing Longitudinal Dimension
Vella’s study was completed in April 2025, and Fowler notes that model capabilities have improved significantly since then, accelerating the shift toward supervisory engineering. That acceleration makes the pipeline problem more urgent: the better AI gets at generating code, the more the middle loop expands, and the more domain expertise supervisory work requires.
But the research has a temporal gap. The engineers in Vella’s study are experienced professionals who developed their expertise before AI tools were pervasive. Their supervisory judgment was built on years of inner loop work that AI is now automating. Engineers entering the profession today, with AI tools available from the start of their careers, present a different case that the existing research does not yet capture.
This gap is not a criticism of the study. The cohort of engineers who have completed a full junior-to-senior progression with AI tools as a constant is not large enough to study yet. The available data resembles studying expert pilots who learned to fly manually and later adopted autopilot, without yet being able to study pilots who learned primarily with autopilot engaged.
An Aviation Parallel Worth Taking Seriously
The aviation industry went through a structurally similar transition. When autopilot systems became capable enough that most commercial flights were conducted largely hands-off, manual flying proficiency degraded among pilots who relied heavily on automation. The FAA’s response was not to restrict autopilot use; it was to require periodic manual flying proficiency checks and to develop Crew Resource Management as a distinct taught discipline, focused on the human oversight skills needed when automation fails or reaches its limits.
Software engineering has no equivalent. There are no requirements for demonstrating proficiency in manual code construction. There is no structured training program for supervisory engineering. There is no credentialing framework that distinguishes engineers who can accurately evaluate AI output from those who cannot. Lisbeth Bainbridge’s 1983 paper on the ironies of automation observed that the more capable an automated system becomes, the more critical human oversight becomes when automation fails, but the less opportunity humans have to maintain the skills that oversight requires. That dynamic is now a software engineering problem.
The closest analog in software is informal mentorship in healthy engineering teams, where senior engineers review junior work and explain why AI-generated solutions do or do not fit the specific context. Those structures are valuable, but they are unevenly distributed and not designed with supervisory skill-building as an explicit goal.
The Hiring Displacement and Its Downstream Cost
This is not a hypothetical trajectory. Reporting across the industry describes significant reductions in junior engineer hiring, partly driven by productivity claims suggesting that senior engineers with AI tools can do what previously required larger teams. The logic is locally coherent: if AI generates the code, fewer people need to generate code.
What this logic misses is the pipeline function that junior roles serve. Junior engineers become senior engineers. Senior engineers are the ones with the domain expertise to supervise AI output effectively. If the pipeline narrows because junior roles are perceived as redundant, the expertise pool available for supervisory engineering work shrinks over a five-to-ten year horizon in ways that are not visible in near-term productivity metrics.
The METR study from 2025 found roughly 20% average time savings from AI assistance with high variance across tasks and individuals. That average obscures a distribution where the engineers getting the largest gains are the ones with the deepest existing expertise. Junior engineers getting below-average or negative gains from AI assistance are also the ones who most need the learning opportunities that AI is displacing.
What Supervisory Engineering Demands from How We Train Engineers
The name matters, as Fowler notes. Calling this supervisory engineering rather than “reviewing AI output” implies that there are better and worse ways to do it, that it can be taught and practiced, that performance can be measured, and that career progression in it looks like something other than just becoming a faster prompt writer.
If the field takes that seriously, the implications for how junior engineers are trained are significant. Deliberate practice of manual implementation alongside AI-assisted work, explicit teaching of domain-specific failure modes in AI output, and structured review processes where supervisory decisions are made visible rather than implicit, all of these would need to be designed into onboarding programs rather than left to incidental discovery.
None of that is especially novel from a pedagogical standpoint. What would be novel is the field recognizing that the generation effect cannot be fully substituted by verification, and treating the development of supervisory skill as something requiring active investment. The GitClear analysis of 211 million lines of code found increasing code churn correlating with AI adoption, code written and then substantially modified or reverted within two weeks. That churn is largely invisible in productivity metrics and hiring decisions. It may be the early signal of what happens when verification skill does not keep pace with generation speed.
The middle loop concept is useful because it names work that currently falls through the cracks of how engineering is measured and taught. The harder follow-on question is how to ensure the expertise required to do that work well continues to develop in a profession that is rapidly automating the experiences that used to build it.