The Competence Paradox at the Heart of Supervisory Engineering

The concept of the inner and outer loop has been a useful mental model for software development teams for years. The inner loop covers the tight cycle of writing code, running tests, and debugging, work that happens in the span of minutes or hours. The outer loop covers commit, code review, CI/CD, deployment, and production observation, the cycle measured in hours or days. Between these two loops there has always been a gap, but it was narrow enough that engineers rarely needed to name it.

Annie Vella’s research into how 158 professional software engineers interact with AI, discussed in a recent Martin Fowler fragment, names what has been growing in that gap: supervisory engineering. Her study found that engineers are shifting from creation-oriented tasks to verification-oriented tasks, but not the kind of verification that looks like code review or QA. It is the effort of directing an AI system, evaluating what it produces, and correcting it when it goes wrong.

Fowler frames this as a possible middle loop, a new layer between the inner and outer loops where engineers supervise AI doing what they used to do by hand. This framing is clarifying, but it understates the core challenge. The real problem with supervisory engineering is not the volume of work in the middle loop. It is the nature of the competence that verification requires.

Creation and verification are not symmetric skills

There is a temptation to think of supervisory engineering as a lighter form of engineering, a kind of quality control role layered on top of AI output. The evidence does not support that reading.

Research in cognitive science on expert performance consistently shows that the ability to recognize quality output in a domain is built through extensive production in that domain. Gary Klein’s work on recognition-primed decision making describes how experts evaluate quickly and accurately not because they have learned a separate evaluation skill, but because their production experience has built a dense network of patterns they can match against.

In software engineering, this shows up as the difference between how a senior engineer and a junior engineer read code. The senior engineer does not check each line methodically; they notice things that feel wrong, inconsistencies that signal deeper problems, patterns that suggest the author did not understand the constraints. This ability comes from years of writing, debugging, and running code, not from years of reading it.

Supervisory engineering asks engineers to apply this pattern-recognition to AI output at volume and at pace. But if the inner loop work that builds that pattern library is increasingly being done by AI, then the question becomes: how does the next generation of engineers develop the competence to supervise output they have never learned to produce themselves?

The historical parallel is instructive

This tension has appeared before in the profession’s history. The rise of CASE tools in the 1980s and 1990s promised to automate significant portions of software design and code generation. They failed partly for technical reasons, but also because the engineers using them could not evaluate the output without understanding what the tools were doing internally. The gap between what the tool could generate and what engineers could verify became the limiting factor.

Low-code platforms in the 2010s faced a related problem. They were productive for people who already knew software well enough to recognize the platforms’ limitations, and brittle in the hands of people who did not have that foundation.

Automated testing presents a more successful case. When continuous integration and automated test suites became standard in the early 2000s, they did change where engineers spent their time. But writing tests well requires understanding the code under test deeply, and reading test failures requires understanding what the test was actually checking versus what it appears to be checking. Automation shifted the inner loop but did not eliminate the need for the underlying knowledge; it redirected it.

The question with current AI systems is whether supervisory engineering will follow the automated testing model, where underlying knowledge is redirected and preserved, or the CASE tools model, where the gap between what the system produces and what engineers can evaluate grows until the tool itself becomes the liability.

What the middle loop actually demands

If supervisory engineering is going to function as a sustainable practice, it needs to be understood as a skilled discipline with specific demands, not as a managerial overlay on top of coding.

Directing AI effectively requires knowing enough about the problem space to decompose it into tasks the AI can handle, and to recognize when the AI is solving the wrong problem. This is not prompt engineering in the narrow sense of crafting phrasing. It is closer to the work of a senior engineer breaking down a feature for a team: knowing which parts are well-understood and which need investigation, which constraints are hard and which are assumptions worth questioning.

Evaluating AI output requires reading code at a different level than the author level. The engineer reviewing AI-generated code cannot ask what the system was thinking with any reliability, and must assess correctness, security implications, performance characteristics, and maintainability without the context that usually accompanies human-authored code. AI-generated code tends to look locally plausible, which means the patterns that would normally flag problematic code at a glance may not trigger until the reviewer examines it more deeply.

Correcting AI output is the most underappreciated part of the middle loop. When AI-generated code is wrong, directly editing the output is often not the most productive path. Changing the input, adjusting the prompt, the context, the scaffolding, or the constraints, until the system produces something closer to correct, tends to work better. This requires a working theory of why the AI produced the wrong output in the first place, which requires understanding the system’s failure modes. Those failure modes are not always predictable or consistent, which makes this harder than debugging deterministic code.

The acceleration problem

Vella’s research concluded in April 2025, before several significant model capability improvements. The underlying shift she documented has very likely continued, which means the middle loop is growing faster than the profession’s understanding of how to work well within it.

The practical risk is that teams optimize for throughput: AI generates more code, engineers review more code per day, and the quality bar for each individual review drops as volume increases. Under this pressure, supervisory engineering degrades into rubber-stamping. The middle loop exists on paper but does not function as a genuine quality gate.

The structural risk is longer term. If engineers spend years in the middle loop without time in the inner loop, they will progressively lose the production experience that makes their supervision valuable. Senior engineers today can evaluate AI output effectively because they have years or decades of inner loop experience to draw on. Junior engineers joining the workforce now may be spending the formative years that would have built that experience in supervisory roles instead. Whether they will develop equivalent competence through supervision alone is an open empirical question, and the profession has not answered it yet.

This is not an argument against AI-assisted development. The productivity gains are real and the trajectory is not reversible. It is an argument that the profession needs a concrete account of how supervisory engineers develop and maintain the domain knowledge that makes their supervision worth having, rather than assuming it transfers automatically from senior engineers or emerges organically from supervisory work itself.

Naming it is not solving it

The value of Vella’s research and Fowler’s framing is that they give the profession a vocabulary to discuss what is happening. Supervisory engineering as a concept makes the middle loop visible, which is a precondition for thinking carefully about it.

But naming a new loop does not resolve the tension at its center. Supervision that works requires deep production knowledge. Deep production knowledge requires practice in the inner loop. If the inner loop is being automated away, the profession needs a concrete account of how supervisory engineers develop and maintain the competence that makes their supervision worth having.

That account does not yet exist in any systematic form. The research on developer experience with AI tools has focused heavily on throughput and satisfaction metrics, not on long-term skill development or the quality characteristics of supervised output. Building a clear-eyed understanding of what the middle loop demands, and how to develop the people who work in it, is probably the most important engineering education problem of the next several years.