· 6 min read ·

Supervisory Engineering Is Not Softer Work

Source: martinfowler

The inner loop and outer loop framing has been part of how developers talk about their work for years. The inner loop is the tight cycle of edit, build, test that a developer runs locally, sometimes dozens of times per hour. The outer loop is the integration layer: commit, pull request, CI pipeline, review, deploy, observe. The two loops have different cadences, different failure modes, and different feedback densities.

Martin Fowler’s recent fragment picks up a research thread from Annie Vella, who studied 158 professional software engineers to understand how AI tools were changing where they spent their time. Her finding was a shift from creation-oriented to verification-oriented work, which she frames as “supervisory engineering”: the effort of directing AI, evaluating its output, and correcting it when it’s wrong. Fowler extends this into a structural claim. There is a new loop forming between the inner and outer loops, a middle loop where engineers manage AI doing the work that used to be the inner loop itself. That framing is worth sitting with, because it carries real implications for what the job is, what skills it requires, and how you get better at it.

What the inner loop actually was

The inner loop was never just typing code. It was the place where engineering judgment accumulated through tight feedback. When you write a function and immediately run the tests, you get information: your mental model was right or it wasn’t, the type signature works or it doesn’t, the edge case you forgot about shows up immediately. Do this for years, and the feedback shapes your intuitions about what code tends to work, what tends to break, and where complexity hides.

The loop had to be tight to be formative. The compiler error you get in thirty seconds teaches you something different than the CI failure you see thirty minutes later. Immediate feedback loops are how experts build the pattern recognition that makes them fast. The developer inner loop concept that Microsoft and others have written about extensively is not just a productivity metric; it is a description of where expertise gets built.

If AI is automating the inner loop, engineers who rely on it are opting out of that feedback. The code still gets written. The tests still run. But the engineer is not in the loop; they are outside it, reading a summary of what happened.

What supervisory engineering actually requires

Vella’s framing of “supervisory engineering” as directing, evaluating, and correcting AI output maps cleanly onto what you see in practice with tools like Cursor, GitHub Copilot, Claude Code, or Aider. The experience is something like: you describe what you want, the tool produces code, you read it, you decide whether it’s right.

That sounds easier. It often is, for straightforward things. But evaluation is a specific skill, distinct from creation. Reading code for correctness requires knowing what correct looks like, which is a function of having written a lot of code, read a lot of code, and seen a lot of ways code can be wrong without looking wrong.

The risk is circular. If engineers stop doing the inner loop work, they accumulate less of the knowledge that evaluation depends on. The supervisory role becomes harder the less you understand what the AI is doing. You can catch obvious errors; you miss the subtle ones. This is not hypothetical. It is the same dynamic that plays out in any domain where automation replaces skilled practice: the operator skill required to oversee the automation slowly degrades because the practice that built it is no longer happening. Aviation researchers have documented this as automation bias and skill decay in pilots; there is no reason to think software engineering is immune.

There is also the direction problem. Describing what you want to an AI, precisely enough that it produces the right thing, is not a trivial skill. It requires understanding the problem deeply enough to specify it, which is most of the engineering work anyway. The difference is that the specification is now in natural language rather than code, and the feedback on a bad specification comes later and is harder to interpret. A bad function signature fails immediately at compile time. A bad prompt produces plausible-looking code that fails in production, or never at all, just does the wrong thing quietly.

The middle loop needs its own infrastructure

One reason the inner loop was effective as a learning environment is that it had dense feedback infrastructure. Compilers, type checkers, test runners, linters, profilers: all of them were built to close the loop quickly and specifically. You make a change, something tells you whether it was right.

The middle loop has almost none of this. Evaluating AI output is largely manual. You read the code. You run the tests if there are tests. You think about whether the approach makes sense. There is no tool that tells you whether your prompt was well-specified, whether the AI chose the right abstraction, or whether the code it produced will cause problems at scale three months from now.

This matters for how engineers learn in the new environment. If the inner loop built expertise through high-density feedback, and the middle loop has low-density feedback, then engineers working primarily in the middle loop will accumulate expertise more slowly and in less targeted ways. The question of how to build good feedback infrastructure for the middle loop is one that the tooling ecosystem has not seriously addressed yet. There are emerging evaluation frameworks like LangSmith and internal “evals” pipelines at AI labs, but these are aimed at evaluating AI systems, not at helping engineers evaluate their own supervisory decisions.

Code review existed to catch what tests missed, and review culture developed conventions over decades: what to look for, how to communicate it, what quality standards matter. The middle loop needs analogous conventions and tools for evaluating AI output, and they do not exist in mature form. The closest thing is prompt engineering lore, which is largely informal, scattered across blog posts and Discord servers, and not grounded in any systematic study of what makes AI-directed engineering go wrong.

The outer loop is also under pressure

The outer loop includes commit, review, CI/CD, deploy, and observe. AI is starting to affect these too, though less completely than the inner loop. Code review is the most obvious pressure point. If AI can generate large volumes of plausible code quickly, the review queue grows. Reviewers face more code per unit time, some of it AI-generated in ways that are syntactically correct but architecturally questionable. The conventions that made review tractable, around PR size, scope, and description, were calibrated for human creation rates.

There is also the observability question. The outer loop’s observe step is where you learn whether your software actually behaves as intended in production. This is another high-density feedback environment, and it is one where AI assistance remains thin. The engineers who understand production observability well are the ones who can connect runtime behavior back to code decisions, which requires understanding the code at a level of depth that comes from having written it, or something like it.

A role under reconstruction

Fowler notes that Vella’s research concluded in April 2025, before recent model improvements, and that his sense is the shift toward supervisory engineering has only accelerated since. That seems right. The capability curve for AI coding tools has been steep enough that practices which were stable eighteen months ago are already changing.

What that means for the role is genuinely unsettled. Supervisory engineering is not a lower-skill version of the old job. Directing, evaluating, and correcting AI output well requires deep knowledge of what correct looks like, which is the product of the very practice AI is now replacing. The role is differently demanding, and it requires building new skills while maintaining old ones that are now practiced less frequently.

Whether the middle loop solidifies as a permanent structure or turns out to be a transitional state as models improve further is not clear. If models get good enough to pull their own code reviews and interpret their own production failures, the loop model will need revision again. For now, the middle loop is where a growing fraction of engineering work actually happens, and it deserves the same attention to tooling, feedback infrastructure, and deliberate learning that the inner and outer loops received over the past thirty years.

Was this interesting?