The Middle Loop Has No Feedback Infrastructure

The research is building up. Annie Vella studied 158 professional software engineers and found a consistent pattern: time is shifting from creation-oriented work toward verification-oriented work. Martin Fowler picked up her thesis and gave the structural observation a name in his March 2026 fragments post: the middle loop.

The inner loop is the tight cycle of writing, building, and testing. The outer loop is the slower cycle of committing, reviewing, deploying, and observing. The middle loop sits between them: directing AI, evaluating its output, and correcting what is wrong. Vella calls this supervisory engineering work.

That framing is useful. The problem it points to is harder than the naming.

Three Loops, Two With Infrastructure

The inner loop has spent decades accumulating feedback mechanisms. Your editor highlights errors as you type. Your test suite runs in milliseconds. Your debugger shows exactly where execution diverged from expectation. The cognitive contract is tight: do something, see the result, adjust. The SPACE framework from Forsgren and colleagues formalizes inner-loop feedback latency as a distinct dimension of developer performance.

The outer loop has its own infrastructure, more recently built. DORA metrics give teams a vocabulary for outer-loop health: deployment frequency, lead time for changes, change failure rate, time to restore. The Accelerate research showed these metrics predict organizational outcomes. CI systems run your tests against every commit. Code review tools track time-to-merge and comment density. The outer loop is slow by nature, but it has tooling that makes it observable and improvable.

The middle loop has none of this. There is no established metric for supervisory precision, meaning the fraction of AI outputs an engineer correctly identifies as flawed, weighted by severity. There is no equivalent to an SLO governing acceptable rates of AI-introduced bugs reaching review. There is no postmortem format adapted for “the AI generated something plausible but wrong and I did not catch it.”

You can observe that a new kind of work exists. Measuring whether you are doing it well is a different matter.

What Makes the Middle Loop Different From Review

Vella distinguishes supervisory engineering from conventional code review, and the distinction matters. When you review a colleague’s pull request, you operate with certain assumptions: the author understood the requirements, made considered choices about trade-offs, and can answer questions about why the code works the way it does. Review is adversarial in a mild sense, but it is built on a foundation of shared intent.

Reviewing AI-generated code has different preconditions. The model has no intent, only pattern completion. Code can be syntactically clean, pass the type checker, pass the test suite, and still be subtly wrong in ways that require domain knowledge to catch. The BCG “Jagged Frontier” research found that AI-assisted workers performed worse than unassisted workers on tasks outside the AI’s capability boundary, specifically because they could not reliably identify which side of the boundary they were on. The output does not tell you whether you are looking at a task the model handles well or one where it is confidently wrong.

This creates the core difficulty of supervisory engineering: the failure mode is invisible until something breaks. A Purdue study found roughly 40% of GitHub Copilot suggestions contained errors, with developers accepting them at high rates. A Stanford security study found developers using AI coding assistants were more likely to introduce security vulnerabilities. GitClear’s analysis of 211 million lines of code found elevated churn rates in AI-assisted codebases, a coarse signal that something is getting past review that should not.

These are lagging indicators. By the time churn rates move, the middle-loop failures have already shipped.

The Bainbridge Problem

Lisanne Bainbridge’s 1983 paper “Ironies of Automation” described this dynamic for industrial process control. Her argument: the better an automated system becomes, the less practice human operators get in manual control, and therefore the less competent they become at the moments when the automation fails and manual intervention is needed. The skill automated away is precisely the skill required to catch automation errors.

Software engineering is running the same experiment. The inner loop, where engineering intuition gets built, is the part being automated. The generation effect from cognitive science holds that information you produce yourself is retained more deeply than information you read and evaluate. Writing a function and watching it fail teaches something structurally different from reading a generated function and deciding whether it looks correct. The inner loop was, functionally, the training loop for the judgment that makes the middle loop work.

This creates a compounding problem. Middle-loop supervision quality depends on the inner-loop experience the engineer brings to it. You can catch a subtle concurrency bug in generated code if you have worked through enough concurrency bugs to have an intuition for them. Engineers who move into supervisory engineering before accumulating that foundation are evaluating AI output without the knowledge base that makes evaluation reliable. Better models make this worse rather than better: more plausible output means more output requiring careful evaluation, and more of it passing automated checks.

Fowler notes this directly. The model improvements since Vella’s research concluded in April 2025 have accelerated the shift toward supervisory engineering rather than reversed it.

What SRE Took Twenty Years to Build

Site reliability engineering is structurally similar to supervisory engineering. SREs direct automation, evaluate system behavior, and intervene when the automation produces wrong outcomes. Google started building SRE practices around 2003. The SRE book appeared in 2016. The discipline took roughly two decades to develop its measurement vocabulary: service level objectives, error budgets, toil recognition, blameless postmortems. That vocabulary is what makes SRE improvable as a practice rather than just a job title.

The middle loop is forming without any equivalent infrastructure. There is no principled framework for setting acceptable error rates on AI-generated code. There is no systematic postmortem process for AI-introduced bugs analogous to the blameless postmortem for production incidents. There is no career attribution mechanism for supervisory quality: an engineer who catches every AI error before review and one who catches none look identical by current metrics until change failure rates move months later.

DORA metrics measure the outer loop. The SPACE framework measures the inner loop and some outer-loop dimensions. Neither was designed to capture middle-loop behavior. You could measure middle-loop volume, counting prompts issued and iterations per task, but volume tells you nothing about precision.

What Building This Infrastructure Might Look Like

SRE built its vocabulary by formalizing concepts that were already implicit in how good systems engineers thought about their work. Error budgets made an implicit trade-off between reliability and velocity explicit and negotiable. Toil recognition separated work that could be automated from work that required judgment.

The middle loop needs something similar. A starting point might be tracking AI-introduced bug rates separately from hand-written bug rates in incident postmortems, creating a rough empirical baseline for middle-loop failure. Code review tools could surface when a change is predominantly AI-generated, enabling review checklists calibrated to the actual risk profile of AI output rather than the same checklist applied uniformly. Hiring and performance criteria could include explicit evaluation of supervisory engineering quality, which currently has no standard form.

None of this is straightforward. Middle-loop failures are harder to attribute than outer-loop failures because the AI’s contribution is typically blended with the engineer’s prompting, context-setting, and editing. The boundary between “the model got it wrong” and “the engineer directed it incorrectly” is not clean.

Addy Osmani’s observation about the “70% problem” points at where the measurement gap lives. AI reaches approximately 70% completion quickly; the remaining work on edge cases, integration, and correctness under real conditions often costs more time than the initial generation saved. That cost is paid in the middle loop, where current metrics cannot see it.

The naming of supervisory engineering is a clarification worth having. The measurement infrastructure that would let teams improve at it is still missing, and history suggests it will take longer to build than the loop itself took to emerge.