Annie Vella’s research, summarized in Martin Fowler’s March 16 fragment, studied 158 professional software engineers using AI tools and found a consistent pattern: they’re shifting from creation-oriented work to verification-oriented work. Fowler proposes structural vocabulary for it, a middle loop sitting between the inner loop of write/build/test and the outer loop of commit/review/deploy. The middle loop is where you direct AI, evaluate its output, and correct what’s wrong.
The concept is useful, and the name helps. There’s something embedded in the finding, though, that gets less attention than the loop framing itself: the quality of supervisory engineering work is entirely determined by the quality of the inner loop experience the engineer brings to it.
The inner and outer loops have measurement frameworks
The inner and outer loop vocabulary entered mainstream engineering culture through developer experience research. Microsoft’s tooling documentation formalized the inner loop as the fast, local write/build/run cycle. The SPACE framework from Nicole Forsgren and colleagues made inner loop feedback latency a measurable dimension of developer performance. DORA metrics gave the outer loop concrete targets: deployment frequency, lead time for changes, change failure rate, and mean time to recovery, all linked through substantial research to organizational outcomes.
The inner loop has build times, test latency, IDE responsiveness. The outer loop has deployment frequency and change failure rate. The middle loop has nothing equivalent, and that gap matters because the middle loop is where a growing proportion of consequential engineering decision-making now happens. When you accept an AI-generated function, you’re making an evaluation call that propagates through code review, CI, and production. When you re-prompt because something looks wrong, you’re exercising judgment. Neither action is currently measured, tracked, or structured around feedback.
Evaluation has its own cognitive demands
The shift Vella documents, from creation to verification, sounds like a reduction in cognitive load. Evaluation has its own cognitive demands, and they’re different in kind from generation. Generating code involves synthesizing requirements, navigating design trade-offs, and resolving ambiguity in the act of writing. Evaluating AI-generated code involves holding intent in your head, reading something stylistically plausible, and determining whether it’s semantically correct. Evaluation has a failure mode that generation does not: automation bias.
Automation bias is documented across human factors research in aviation and medicine. The core finding is that operators under-scrutinize plausible automated output. A 2012 study in Human Factors by Casner, Geven, and Williams on airline crews showed measurable bias toward accepting automated system outputs even when manual verification would have caught errors. The outputs looked correct; the operators skipped the check.
AI-generated code operates under the same dynamic. The code is syntactically correct, stylistically plausible, and often structurally sound at a surface level. The errors tend to be semantic: wrong API behavior assumed, edge case missed, permission check omitted. A Purdue University study of GitHub Copilot suggestions found roughly 40% contained errors, and developers accepted them at high rates. The acceptance rate is not uniform across engineers, and nothing in current tooling makes that visible.
The Bainbridge parallel
Lisanne Bainbridge’s 1983 paper “Ironies of Automation” documented a structural problem that appears across domains where automation handles the common case: the more reliable an automated system, the less practice its operators get, and the less capable they become when the automation fails and manual control is needed. She was writing about industrial process control; the pattern generalizes to aviation, medicine, and now software development.
Aviation addressed the problem through structured practice: mandated manual flying time and Crew Resource Management as a formal discipline for managing the human/automation interface. FAA guidance on automation policy reflects decades of empirical data on skill degradation with autopilot reliance. The response was to build infrastructure around maintaining the skills that automation was eroding.
Software development is in the early stages of the same dynamic. AI handles code generation, the build/test cycle, and increasingly debugging. Engineers who were previously maintaining those skills through daily practice are now supervising the process instead. The skills required to evaluate AI output reliably, knowing what correct code looks like, recognizing subtle API misuse, catching omitted edge cases, are the same skills that daily code generation was building.
What this looks like in practice
When I use AI assistance on bot code, the speed of my evaluation tracks closely with how well I know the domain. Discord’s slash command routing, rate limiting, permission models: I can scan generated code for problems in those areas quickly because I’ve written that code by hand many times. In areas where I’m less experienced, my evaluation is slower and my confidence about accepting suggestions is lower. The AI’s output looks equally plausible in both cases. My ability to catch errors is not symmetric.
This is the skills inversion embedded in Vella’s finding. AI coding tools deliver the largest productivity gains to engineers who already have strong domain knowledge, because those engineers evaluate AI output accurately and quickly. Engineers earlier in their careers, still building the domain knowledge that makes evaluation reliable, get less benefit from generation speed and face greater exposure to automation bias. The tools are most powerful for the engineers who need them least.
For teams thinking about how to develop engineers in an AI-assisted environment, this matters. The inner loop, with all its friction and manual work, was also where domain expertise accumulated through repetition. Supervisory engineering work requires that expertise to function well. If the inner loop practice that builds it is increasingly handled by AI, the middle loop loses its foundation.
Where the framing leaves off
Vella’s research finished in April 2025. Fowler notes that model improvements since then have accelerated the shift to supervisory work rather than reversing it. The middle loop is expanding faster than the frameworks for managing it are developing.
The inner loop accumulated build tools, linters, test runners, and eventually measurement frameworks through SPACE and DX research. The outer loop got CI/CD infrastructure, then DORA. The middle loop currently has none of that equivalence, and the tooling that would help, something that tracks evaluation quality over time or connects prompt decisions to downstream defect rates, does not exist in any mainstream developer tooling today.
That’s the gap Vella’s research surfaces, even if the framing doesn’t quite reach it. Supervisory engineering is a loop with its own failure modes, its own skill prerequisites, and currently no feedback infrastructure to help practitioners improve at it. The name for the problem is new. The structural pattern, a system relying on skills that its own tools are quietly eroding, is not.