A study of 158 professional software engineers, published as a thesis by Annie Vella and summarized recently by Martin Fowler, puts a name to something many developers have been feeling but struggling to articulate: the emergence of supervisory engineering work. The finding is that AI tools are not simply making engineers faster at the same tasks. They are shifting what engineers spend their time doing, from generating code to directing, evaluating, and correcting AI-generated code. That shift sounds incremental, but the implications run deeper than most productivity framing acknowledges.
The Loop Model and Where It Falls Short
Software developers have long used a two-loop mental model to describe their work. The inner loop is the rapid, local cycle: write code, run it, see what breaks, debug, iterate. The feedback latency here is seconds to minutes. The outer loop is the slower cycle that begins when code leaves your machine: commit, open a pull request, wait for CI, get reviewed, merge, deploy, observe in production. The feedback latency here is minutes to days.
This model has been the foundation of developer experience research, including the DORA metrics framework from Nicole Forsgren, Jez Humble, and Gene Kim, which focuses heavily on outer loop efficiency, and the growing body of work around inner loop measurement from teams like DX Research. The underlying assumption is that these two loops account for where engineering time goes.
Vella’s research suggests that assumption is breaking. AI coding assistants are increasingly automating the inner loop. Code generation, test scaffolding, boilerplate, even debugging suggestions: these are now AI-assisted or AI-handled. The outer loop remains largely human-driven, though AI code review tools are starting to encroach there too. Between the two, a new layer of cognitive work is accumulating that does not fit neatly into either category. Vella calls this the middle loop, and defines it as the supervisory effort required to direct AI, evaluate its output, and correct it when it goes wrong.
What Supervisory Work Actually Looks Like
Supervision is not prompt engineering. The popular framing of prompt engineering suggests that the main skill shift is learning how to ask AI better questions. Vella’s participants described something more continuous and demanding than that. They spent time reading AI output critically, running it mentally against their knowledge of the domain, testing it in ways the AI had not anticipated, and deciding what was usable, what needed modification, and what needed to be thrown out entirely.
This is verification work, but it is different from the verification that shows up in code review. Traditional code review happens after code has been written, mentally constructed, and committed by a human who understood the problem before writing the solution. The reviewer is checking work against a shared understanding of intent. Supervisory verification is different: the engineer is evaluating code they did not mentally construct, for a problem the AI may have partially or incorrectly understood, produced at a speed that outpaces the engineer’s ability to trace through it carefully.
The GitHub Copilot productivity study from Peng et al. (2023) found that developers completed a programming task 55.8% faster with Copilot than without it. That number is real, but the task was a self-contained, greenfield implementation, optimally suited to code generation. The verification cost in that context is low because the problem is well-specified and the output is easy to test. In maintenance work on large codebases, with poorly specified requirements and deep contextual dependencies, the verification cost rises sharply. The generation speed stays roughly constant. The bottleneck moves.
The Skills Inversion
The most interesting structural consequence of supervisory engineering is that it inverts the relationship between domain knowledge and task performance. In traditional software development, deep domain knowledge was primarily a prerequisite for producing good code. You needed to understand the system to write something correct. With AI assistance, domain knowledge becomes primarily a prerequisite for evaluating generated code. You need to understand the system to judge whether what the AI produced is right, even if you no longer needed that knowledge to produce it yourself.
This is a meaningful inversion. It means that AI tools provide the largest productivity uplift to engineers who already know the domain well, because they can evaluate AI output efficiently and catch errors quickly. Engineers who are still learning a domain face a different dynamic: they get fast code but may lack the judgment to assess its correctness, which shifts their error rate from production errors to supervision errors.
The BCG study by Dell’Acqua et al. (2023) on consultants using GPT-4 captured this through the concept of the “jagged frontier”: AI assistance produced strong gains on tasks within the model’s capability boundary and actual performance degradation on tasks outside it, because participants could not reliably identify which side of the boundary they were on. The supervisory skill is, in part, the ability to locate that frontier accurately.
The Verification Bottleneck
Generation speed now exceeds human verification speed in many practical settings. A developer working with an agentic coding assistant can receive hundreds of lines of code, spanning multiple files, in the time it would previously have taken to write a single function. The code may be mostly correct. It may contain one subtle error. The cost of verifying it is not proportional to its quality: you have to read all of it to find the one error, or you have to accept a higher defect rate.
This dynamic shows up in the data. A GitClear analysis of 211 million lines of code (2024) found a significant increase in code churn, meaning code written and then reverted or substantially modified within two weeks, correlating with AI tool adoption. The code was being generated faster than it was being verified correctly before merge. The verification step was not keeping pace.
This is the middle loop problem in concrete form. The inner loop accelerates. The outer loop stays roughly the same. The work that used to live in the inner loop, the careful mental tracing of what code actually does, migrates into a new supervisory layer that sits between the two, and that layer has no clear tooling, no established metrics, and no formal workflow support.
What This Changes About the Role
Vella’s research finished in April 2025, before the most recent generation of frontier coding models were widely deployed. Fowler’s commentary is that the improvement in models has not reduced the supervisory burden; it has accelerated the shift toward it. More capable AI means more code generated per unit time, which means more code to supervise per unit time, which means supervisory skill becomes more important, not less.
The Stack Overflow Developer Survey 2024 found that 76% of developers were using or planning to use AI tools, with code generation as the primary use case. A significant fraction reported concerns about output accuracy. The concern is well-founded, but the response to it, more careful supervision, is also more expensive than it looks in productivity metrics, because productivity metrics are measuring generation speed, not verification quality.
The middle loop concept matters because it gives a name to work that currently falls through the cracks of how engineering is measured and taught. Developer productivity frameworks measure lines of code, commit frequency, PR cycle time, and deployment frequency. None of these capture the quality of supervisory work. A developer who catches every AI error before it reaches review looks, in current metrics, identical to a developer who catches none of them, until the defects surface in production.
Building Skills for the Middle Loop
If supervisory engineering is a real and growing part of the role, the skills it demands deserve explicit attention. The ability to read unfamiliar code critically and quickly has always mattered, but it now matters more than it did when engineers were producing the code they reviewed. The ability to write targeted tests for code you did not write, the ability to identify when an AI has misunderstood the problem statement rather than just the implementation, and the ability to recognize when a solution is locally correct but architecturally wrong: these are supervisory competencies that are distinct from the generative competencies that most engineering training and practice has historically emphasized.
The framing of “supervisory engineering” is useful precisely because it separates this work from adjacent categories. It is not code review, because it happens before commit. It is not debugging, because the code may be syntactically and semantically correct while still being wrong for the problem. It is not prompt engineering, because most of the cognitive work happens in evaluation, not generation. It is a new category of engineering labor, and naming it is the first step toward treating it seriously.