· 7 min read ·

The Middle Loop Has No Feedback Mechanisms

Source: martinfowler

The concept of inner and outer loops in software development has been around for at least a decade. The inner loop is the tight feedback cycle during active coding: write, compile, run, observe, repeat. The outer loop is the broader integration cycle: commit, open a pull request, wait for CI, get reviewed, merge, deploy, monitor. Teams have spent enormous effort compressing both loops, through fast incremental builds, live reload, and hot module replacement on the inner side, and through trunk-based development, fast CI pipelines, and feature flags on the outer.

Annie Vella’s research into how 158 professional software engineers use AI tools, summarized by Martin Fowler in his March 2026 fragments, introduces a third loop sitting between these two. Vella calls the work that happens there “supervisory engineering work.” The framing is worth taking seriously, not just as an academic label, but as a practical framework for understanding what working with AI actually demands and what skills it requires that we have no good way of training or measuring.

What the inner/outer loop framing was always about

The inner/outer loop vocabulary entered mainstream engineering culture through developer experience research and tooling discussions. Microsoft’s developer division used it heavily when justifying investments in fast incremental builds, Live Unit Testing in Visual Studio, and later hot reload in .NET. The idea was simple: the faster the inner loop, the more iterations a developer could run per hour, and more iterations meant faster learning and more throughput.

This framing influenced broader DevEx research. The SPACE framework from Nicole Forsgren and colleagues treats the quality of the development environment and feedback latency as dimensions of developer performance. The outer loop language came from the DevOps movement: DORA metrics, covering deployment frequency, lead time for changes, change failure rate, and mean time to restore, all measure properties of the outer loop.

Both loops have established feedback mechanisms. The inner loop: the test goes red or green, the compiler errors, the application crashes or works. The outer loop: CI passes or fails, the review leaves comments, the deploy succeeds or rolls back, the monitor fires or stays quiet. Engineers know what good looks like in each loop and have built extensive tooling and practice around both.

The middle loop has no established feedback mechanisms

What Vella identifies as supervisory engineering work sits in territory the inner/outer loop model never accounted for. When a developer uses a code generation tool to produce a function, they are no longer in the inner loop in the traditional sense: they did not write the code and cannot rely on the cognitive residue of having written it to guide verification. They are also not in the outer loop: no CI pipeline has run, no reviewer has seen it, no one has deployed it.

They are somewhere between, performing a kind of review that is structurally different from both inner-loop debugging and outer-loop code review. The code may look correct. It may pass tests. It may even be correct. It may also contain subtle errors, security holes, or logic that is locally valid but globally wrong, and the person who accepted it has no accumulated understanding of the code that would help them notice.

This is the core problem with the middle loop: it demands active adversarial skepticism toward code that often reads fluently and passes obvious checks. Human-written code, when wrong, tends to be wrong in ways that correlate with the author’s misunderstanding. AI-generated code can be wrong in ways entirely decorrelated from surface fluency. A model can produce syntactically clean, idiomatically written code that quietly misuses an API, mishandles an edge case, or encodes a false assumption that remains invisible until production.

The automation bias research from aviation and medical imaging is relevant here. Operators who monitor automated systems tend to over-trust outputs that look plausible, precisely because the outputs are usually correct. This calibration problem is well-documented in high-stakes domains and is only beginning to be studied seriously in software engineering contexts. The fluency of modern code generation makes the problem worse, not better.

What supervisory engineering actually requires

The skills Vella groups under supervisory engineering are not individually new, but their combination and the posture they require is. Consider what actually happens when an engineer directs AI to build a feature.

Specification comes first. The engineer must translate a loosely understood requirement into a prompt precise enough that the model produces something close to correct. This is closer to writing a formal test specification than to writing code. The engineer must anticipate failure modes and encode constraints the model has no way to infer from context. Underspecify and the model produces plausible-looking output that misses the point; overspecify and the engineer has essentially written the implementation in a more verbose form.

Output evaluation comes next. The engineer receives code they did not write and must assess its correctness without authorial context. This requires reading unfamiliar code quickly and forming a judgment about not just what it does but whether what it does matches the intent. It resembles code review, but without knowing the author’s reasoning or being able to ask questions. It also requires knowing enough about the domain and the system to recognize when the model has made a plausible-looking but wrong assumption, which is a high bar.

Steering follows when the output is wrong. The engineer must diagnose the failure in a way that allows useful re-prompting. This diagnostic work is neither inner-loop debugging, where there is a stack trace or failing assertion to interrogate, nor outer-loop process work. It is closer to prompt archaeology: determining what the model understood and did not understand, then formulating a correction that addresses the actual gap rather than just restating the requirement.

The measurement problem

Existing productivity frameworks are poorly suited to measuring middle-loop quality. DORA metrics capture outer-loop outcomes. The traditional inner-loop proxies, commit frequency and lines changed, are being distorted by AI output volumes that have no reliable relationship to engineer throughput. The GitHub Copilot completion acceptance rate measures how often engineers accept suggestions, not whether those suggestions were correct or whether the engineer understood them before accepting.

Gitclear’s research into millions of commits found that code churn, the rate at which recently committed code is subsequently reverted or modified, increased significantly in AI-assisted codebases. That is one signal of middle-loop quality failure: code that passed through the middle loop and entered the outer loop but turned out to be wrong enough to require immediate rework. It is a lagging and coarse-grained signal. It tells you the middle loop failed at some point; it tells you nothing about why or how to catch the failure earlier.

The missing measurement is something like “middle-loop precision”: what fraction of AI outputs the engineer correctly identified as wrong before committing them, weighted by the severity of the errors caught versus missed. No team measures this. Very few teams even do informal retrospectives on AI-introduced bugs that ask whether the reviewing engineer had the background to have caught the mistake.

The skill formation problem

Vella’s research concluded in April 2025, before the current generation of models arrived, and Fowler notes this explicitly, observing that more capable models have only accelerated the shift toward supervisory work. But the question her research raises about skill formation remains unresolved regardless of model capability.

The inner loop was always where foundational engineering intuition formed. Working through a bug, reading a stack trace, running a profiler, understanding why a test failed and then understanding why a fix worked: these are not just productive activities, they are how engineers build the mental models that make them effective at every subsequent task. This is the “germane cognitive load” that learning researchers distinguish from extraneous load, the productive struggle that forms lasting schemas rather than just completing the immediate task.

If AI tools increasingly handle the inner loop, engineers who develop primarily through AI-assisted workflows may complete tasks faster without building those foundational mental models. The output looks right, the PR merges, the feature ships. But the engineer has not developed the debugging intuition or systems understanding that makes a senior engineer capable of the middle-loop and outer-loop judgment that matters at scale.

This concern is already surfacing in teams. Juniors who can generate a working solution but cannot explain it, who struggle to diagnose when it breaks in production, who cannot make a calibrated judgment about whether to trust a given AI output in a domain they have limited experience with. Supervisory engineering requires precisely the kind of foundation that AI-first workflows, when used without deliberate counter-balancing, may fail to build.

What this points toward

The middle loop will become a more explicit part of how engineering teams structure work and evaluate competence. More capable models do not eliminate the middle loop; they may deepen the demands it places on engineers. A model that produces more convincing wrong answers places a higher burden on the engineer’s verification abilities, not a lower one. Calibrated skepticism toward fluent output is not a skill that improves automatically as model capability improves.

Teams that want to develop middle-loop capability deliberately will need practices analogous to what test-driven development did for inner-loop discipline. Specification review before AI generation, systematic output auditing, blameless post-mortems on AI-introduced bugs that ask not just whether the review was careful enough but whether the reviewer had the background to perform it competently in the first place.

The inner loop will not disappear. Engineers who maintain direct hands-on engagement with code will still debug, still run tests, still form mental models through direct work. But the allocation of time is shifting in the way Vella’s research documents, and the skills the middle loop demands are real, specific, and currently going unmeasured in most organizations. Naming the loop is a useful starting point. Building feedback mechanisms for it is the harder and more urgent work.

Was this interesting?