Supervisory Engineering Gets Harder as AI Gets Better

Annie Vella’s research on 158 professional software engineers, summarized recently by Martin Fowler, establishes the concept of supervisory engineering and the middle loop: the new layer of work between the traditional inner loop of coding and the outer loop of shipping. Fowler’s commentary includes an observation that deserves more attention than it typically receives. Vella’s study concluded in April 2025, before the most recent round of significant model improvements. Fowler’s assessment is that those improvements have not reduced supervisory burden. They have accelerated the shift toward it.

This runs against what most people intuitively expect. Better models have lower error rates. Lower error rates should mean less to catch, which should mean less verification work. The common expectation is that supervisory engineering is a transitional concern, a gap between current AI capabilities and some future point where AI is reliable enough that verification becomes lightweight. If that expectation were right, the appropriate institutional response to the middle loop would be patience.

The evidence points in the other direction, and the mechanism is worth understanding precisely.

Volume and Capability Scale Together

The acceleration in model capabilities over the past two years is measurable and ongoing. SWE-bench, a benchmark tracking AI resolution of real GitHub issues, recorded resolution rates around 12% in early 2024. By mid-2025, leading models were resolving above 40% of test cases. That is not a modest accuracy improvement on a static task profile. It reflects a change in what AI can take on: more complex multi-file changes, longer chains of reasoning, tasks that previously required substantial human decomposition before AI assistance was useful.

Better models do not just generate existing categories of code more accurately. They expand the category of code they attempt. A model capable of handling a complete feature implementation generates considerably more output per session than one that handles function completion. GitHub’s 2023 Copilot study found a 55.8% speedup on a self-contained implementation task. That speedup applies to more of the task space as capability improves. More code generated per unit time is more code to verify per unit time. The verification demand scales with the generation speed, not against it.

Scope Complexity and Supervisory Stakes

Early AI coding assistance was most reliable on well-scoped, high-frequency problem patterns: standard data structures, common library APIs, boilerplate in familiar domains. These have high training data representation and relatively straightforward correctness criteria. The verification cost per suggestion was bounded because the problems were bounded.

As models improve, the competent problem space expands to include code with higher systemic consequences: authentication flows, cross-cutting refactors, database query optimization, security-sensitive input handling. Each of these categories carries higher verification stakes. A mistake in a boilerplate utility function is usually obvious and contained; a mistake in an authorization check can be subtle, pass tests, and survive code review before causing a real incident. Security researchers have documented that AI-assisted developers introduced security vulnerabilities at higher rates than non-assisted developers, with the mechanism being AI omitting defensive checks that experienced engineers write by habit, in code that otherwise looked correct. The issue is not that AI is careless about security in a detectable way. It is that AI expands into security-sensitive problem space without flagging the boundary.

The BCG study on GPT-4 introduced the concept of the jagged frontier: AI produces strong gains on tasks within its capability boundary and performance degradation on tasks outside it, because users cannot reliably locate which side of the frontier they are on. As capability improves, the jagged frontier moves and expands. Supervisory engineers must maintain an accurate map of a larger, less stable competent zone. The epistemic task of knowing which outputs to trust is not simpler with better models; it is more complex, because the space they confidently cover is larger and harder to characterize.

The Subtlety Inversion

There is a third mechanism that gets less attention. As AI error rates fall, the remaining errors shift in character. A model that was wrong 40% of the time produced many obvious errors: broken logic, wrong return types, missing null checks. Those errors are relatively cheap to catch because they surface quickly in testing or look wrong on inspection. A model wrong 10% of the time produces fewer errors, but those errors have survived the easy filters. They are the edge case semantic bugs, the race conditions in concurrent code, the off-by-one in pagination, the assumption that fails for a specific combination of inputs. These are harder to catch not because the model has regressed, but because better models produce errors that are structurally indistinguishable from correct code.

This is the subtlety inversion: capability improvements reduce the quantity of errors while increasing the average difficulty of detecting those that remain. A 2024 study from Purdue University found GitHub Copilot producing incorrect code in roughly 40% of test cases, with developers accepting suggestions at high rates. The problem, as the study documents, is that incorrect AI output often reads exactly like correct AI output. As error rates improve, that stylistic indistinguishability does not improve proportionally, because the model has no mechanism for signaling uncertainty. Better models surface fewer obvious errors and more quiet ones.

What the Data Reflects

GitClear’s analysis of 211 million lines of code found increasing code churn correlating with AI tool adoption, code written and then substantially revised or reverted within two weeks. That churn is not a signal of AI producing obviously bad code. It is a signal of supervisory verification not keeping pace with generation speed. The errors that produce churn are not the ones that look obviously wrong at review time; they are the ones that pass the verification that actually occurs, then surface later during integration or testing.

The METR 2025 study found roughly 20% average time savings with high variance across tasks and individuals. That variance is, in part, a function of supervisory quality: engineers with strong domain expertise can evaluate AI output accurately and capture more of the available gain. Engineers who accept output without sufficient evaluation find that time saved in generation is consumed in later correction. The variance does not narrow as models improve, because supervisory skill differences between engineers are not closed by model improvements.

The Middle Loop Is Not Transitional

Lisbeth Bainbridge’s 1983 paper on industrial automation observed that industrial operators needed more expertise, not less, as automation became more capable, because the cases that remained in human hands were the hardest ones, the ones automation could not handle. The same dynamic operates in software development, with an additional complication: unlike a process control system with well-defined failure modes, AI coding assistants fail in ways that look like success at the surface level.

Fowler’s observation that model improvements accelerate the shift to supervisory work is consistent with all three mechanisms above. Volume scales with capability. Scope expands into higher-stakes territory as competence grows. Error subtlety increases as obvious errors are eliminated. None of these mechanisms reverse as models improve; they all compound.

This has direct consequences for how the profession should respond. If supervisory engineering were a transitional concern, limited investment would be reasonable: some workflow adjustments, some team norms, some prompting practices, enough to get through until models are reliable enough to reduce verification overhead significantly. The middle loop gets named, and the field waits for it to close.

If supervisory engineering is permanent, and expanding as capability grows, it needs the institutional treatment that permanent disciplines receive: dedicated training, meaningful metrics, career progression frameworks, and explicit space in how engineering time is budgeted. The DORA metrics framework and inner loop tooling from teams like DX Research measure outer and inner loop health, respectively. Neither has instrumentation for supervisory quality, and that gap does not fix itself as models improve.

Vella’s contribution is naming supervisory engineering precisely enough that the field can reason about it. Fowler’s contribution is pointing out that the trajectory runs counter to the reassuring expectation. The harder implication, one that follows from both but is not quite stated by either, is that investing in supervisory skills and infrastructure is not preparation for a temporary period. It is preparation for a future where supervisory demands are higher than today, because the AI generating the code is more capable than today’s.