Annie Vella studied 158 professional software engineers to understand how AI tools were changing their work. Her finding, summarized by Martin Fowler, is that engineers are shifting from creation-oriented tasks to verification-oriented tasks, and that this shift occupies a new cognitive layer between the inner loop of write-test-debug and the outer loop of commit-review-deploy. Fowler calls it the middle loop, and names the work happening there supervisory engineering: directing AI, evaluating its output, and correcting it when wrong.
The naming is useful. But naming a loop is different from knowing how to hire for it, promote people in it, or measure whether someone is doing it well. The more pressing organizational problem is that every standard tool we have for evaluating engineering work is structurally blind to supervisory quality.
What Supervisory Engineering Actually Requires
Vella’s research describes four distinct competencies in supervisory work.
The first is intent specification: decomposing ambiguous requirements into concrete, verifiable tasks in a way that anticipates where ambiguity will cause wrong model assumptions. This is closer to compiler construction than to project management. You are writing input for a system with a specific, imperfect understanding of your domain, and you are trying to pre-empt the ways that imperfect understanding will produce plausible-looking wrong output.
The second is output evaluation: checking AI-generated code for semantic correctness, security implications, architectural compatibility, and adherence to invariants not visible in the immediate context. A function may be locally correct and globally wrong. The evaluator has to hold both frames simultaneously.
The third is correction and steering: diagnosing why AI produced wrong output and directing it toward better output. This requires a theory of what the model is doing, not just what correct code should look like. The diagnosis is upstream of the fix.
The fourth is context maintenance: tracking what has been delegated, what constraints apply, how partial outputs fit together. With agentic tools that can touch dozens of files in minutes, this is a genuine cognitive challenge, less like reviewing a PR and more like managing a large parallel implementation effort where you wrote none of the code.
None of these map cleanly onto existing engineering skill categories. They are adjacent to design, code review, and technical program management, but distinct from all three. And none of them produce the kinds of artifacts that standard evaluation processes are built to assess.
The Visibility Problem
The output of good supervisory engineering is the absence of defects that would otherwise have been present. That is among the hardest things to measure in any professional domain. Software engineering has always had this problem: senior engineers who prevent problems through architectural decisions and early feedback are systematically undervalued compared to engineers who fix visible problems through heroic debugging. Supervisory engineering makes the visibility problem structurally worse.
Research from Purdue (2024) found that GitHub Copilot produced incorrect code in roughly 40% of test cases, and that developers accepted suggestions uncritically at high rates. The underlying cause is not carelessness. Incorrect AI output is stylistically indistinguishable from correct AI output. Human-written code tends to signal its own uncertainty: defensive edge case handling, comments marking ambiguity, naming that expresses doubt. AI-generated code presents uniform stylistic confidence regardless of semantic correctness.
A developer who catches every subtle error in AI-generated code and a developer who catches none of them produce identical pull requests. Their difference only becomes visible when production defects are traced back through the review history, and by then the signal is weeks or months behind the behavior that caused it.
This is the dynamic that Lisanne Bainbridge described in “Ironies of Automation” (1983), writing about industrial process control systems rather than software. Her argument was that the more capable an automated system becomes, the more critical human oversight becomes when automation fails, and the less opportunity humans have to practice and maintain the skills that oversight requires. The skills needed for effective supervision are the same skills that direct operation develops. Remove direct operation, and those skills erode.
Aviation confronted this problem directly. FAA analysis of automation dependency found measurable degradation in manual flying skills among pilots who relied heavily on automated systems. The regulatory response included mandatory manual flying requirements and the development of Crew Resource Management as a distinct professional discipline. CRM provides a formal curriculum for maintaining situational awareness while monitoring rather than operating, for recognizing when to override automation, and for the team dynamics of human-machine collaboration. It is, structurally, a training and evaluation framework for supervisory work. Software engineering has produced no equivalent.
What Standard Metrics Cannot See
DORA metrics measure the outer loop: deployment frequency, lead time for changes, change failure rate, mean time to recovery. Developer experience tooling from DX Research and similar efforts measures inner loop friction. Neither captures anything about supervisory quality.
An engineer who accepts AI output without rigorous review looks identical in every standard metric to one who catches every error. Commit frequency is unaffected. PR cycle time may be faster. Deployment frequency goes up. The difference only surfaces when change failure rate rises in production months later.
GitClear’s analysis of 211 million lines of code (2024) found a significant increase in code churn correlating with AI tool adoption: code written and then substantially modified or reverted within two weeks. That is a delayed signal of verification that did not happen at the time of merge. But it is still a lagging indicator. It tells you something went wrong. It cannot tell you whether someone was doing good or poor supervisory work.
The organizational consequence of this measurement gap is predictable. Performance evaluation systems reward what they can see. Promotion criteria favor demonstrated skills. If supervisory work is invisible in current tooling, engineers who are exceptionally good at it will not appear exceptional until defects stop appearing in their work over a long enough period. That is a poor signal-to-noise ratio for a six-month review cycle.
What Good Evaluation Might Look Like
The BCG study by Dell’Acqua et al. (2023) on consultants using GPT-4 introduced the concept of the “jagged frontier”: AI assistance produced strong gains on tasks within the model’s capability boundary and meaningful performance degradation on tasks outside it, largely because participants could not reliably identify which side of the boundary they were on. Supervisory skill is, in significant part, calibration about this boundary. Knowing when to trust and when to verify more deeply is a distinct competency from verification itself.
This suggests two dimensions for any evaluation framework. The first is whether someone can catch realistic classes of AI errors given a sample of generated code and a specification. The second is whether they can accurately assess, before generating, which tasks fall within the model’s capability and which require more scrutiny. Both are testable. Neither maps to existing interview formats.
Scenario-based assessment for supervisory engineering would present candidates with code containing realistic errors: a security edge case omitted, a subtle off-by-one in a state machine, a function that is locally correct but incompatible with unstated architectural constraints. Evaluating the candidate’s response, whether they caught the error, correctly diagnosed its origin, and produced a steered correction, gives a direct signal about supervisory competency. The artifact to evaluate is the reviewer’s response to generated code, not the candidate’s original output, which makes the evaluation less dependent on recall under pressure and more dependent on the domain judgment that supervisory work requires.
Vella’s research finished in April 2025, before the most recent generation of frontier coding models were widely deployed. Fowler’s observation is that improved models have accelerated the shift toward supervisory work rather than reduced it. More capable AI means more code generated per unit time, which means more to supervise per unit time, which makes supervisory skill more consequential. The field has spent two years building better code generation tools. Building evaluation tools for the work that code generation creates has barely started.