· 7 min read ·

The Specification Gap at the Heart of Supervisory Engineering

Source: martinfowler

The research from Annie Vella, summarized by Martin Fowler, names something many engineers have been experiencing without vocabulary: a middle loop forming between the inner loop of writing and testing code and the outer loop of commits, review, and deployment. The work that happens there is supervisory engineering, directing AI systems, evaluating what they produce, and correcting it when wrong.

The framing captures the structure accurately. But there is a specific challenge inside it that gets less attention than the evaluation side does: to supervise effectively, engineers need to specify what they want before the AI produces code. This sounds unremarkable until you consider that the traditional engineering process is precisely the opposite. Engineers have historically formed their understanding through the act of building, not before it.

How Building Creates Understanding

The inner loop is tight for a reason. When you write a function and immediately run a test, you get information in seconds. The test fails because you forgot to handle a nil pointer, or because your type assumption was wrong, or because the edge case you considered minor turns out to be central. Each failure reshapes the mental model. Across years of inner loop work, this feedback densifies into pattern recognition. Senior engineers do not just know more; they carry a more accurate internal model of how systems behave, built through thousands of tight feedback cycles.

This mode of working is generative in the strict sense. The understanding does not exist complete at the start and then get expressed in code; it forms through the code. You discover requirements by trying to implement them. You discover architectural constraints by hitting them. The inner loop is a learning environment disguised as a productivity tool.

The middle loop inverts this sequence. Before you can direct an AI effectively, you need to know what you want. Before you can evaluate what it produced, you need the standard against which to evaluate. Both precede the code, not follow from it. The AI does not participate in the generative discovery; it generates from a specification that is already supposed to be there.

Where the Gaps Appear

Consider a practical example. You are adding a feature to a workflow engine that manages state transitions between documents in a review pipeline. In the inner loop model, you would write the state transition handler, run the tests, encounter the failing case for concurrent updates when two reviewers act simultaneously, reconsider the data model, add locking or optimistic concurrency handling, and discover through implementation what the correct behavior should be.

In the middle loop model, you write a prompt describing the feature. The AI produces a state transition handler. It handles the common case correctly. It does not handle concurrent updates, because you did not specify them in the prompt. The code passes the tests you wrote before the feature existed. You review it; it looks correct. Unless concurrent transitions were already in your specification before the prompt, you will not notice the omission. The AI built exactly what was asked. The bug is in what was not asked.

This failure mode is structurally different from what happens when a human engineer makes the same mistake. When you are writing the code yourself, you encounter the concurrent update problem because you are inside the problem. You feel the gap in the design as you try to implement around it. The inner loop gives you a signal. In the middle loop, the specification gap does not announce itself. It produces plausible-looking output and remains invisible until production reveals it.

This is part of what a Stanford study found when it documented that developers using AI coding assistants were more likely to introduce security vulnerabilities. AI-generated code omits defensive checks that experienced engineers write habitually. The “habit” is another term for accumulated inner loop experience, tacit specifications that the engineer applies without articulating them. The AI does not have those habits. The engineer needs to specify them explicitly, which requires knowing they are needed before writing the prompt.

Fowler’s January 2026 conversation with Parsons and Joshi on “what vs. how” specification points at the same tension. The engineer’s job in the middle loop shifts toward articulating the “what” precisely enough for the AI to generate the correct “how.” What that framing does not fully surface is that the “what” was traditionally discovered through the process of figuring out the “how.” Specifying the what in advance requires a prior pass of discovery that the inner loop used to do automatically, embedded in the work itself.

Why Experience Changes the Middle Loop

This reframes the well-documented observation that experienced engineers navigate the middle loop more reliably than junior engineers. The usual explanation emphasizes evaluation skill: experienced engineers are better at identifying when AI output is wrong. That is true, but it is downstream of a prior advantage: experienced engineers carry richer implicit specifications.

When a senior engineer writes a prompt, years of inner loop experience encode constraints they do not fully articulate. They are thinking about the current feature while their background knowledge generates a richer picture of what correct means, including edge cases, failure modes, and architectural constraints they have encountered before. Their specification, even when partly implicit, is more complete before the first token of AI output appears.

Junior engineers face a compounding difficulty. Their implicit specifications are thinner because the inner loop work that builds them has barely begun. They do not know what they do not know to specify, which means the AI builds what was asked and leaves gaps the engineer cannot see. This is not a diligence problem; it is a knowledge problem. The generative discovery that would have revealed those gaps in the inner loop no longer happens in the middle loop.

The METR study measuring experienced developers on real open-source tasks found roughly 20% average time savings with high variance and a meaningful left tail where AI assistance made things worse. The variance is partly a specification problem. Tasks where requirements are well-defined in advance show larger gains; tasks where understanding needs to form through implementation show smaller or negative gains, because the middle loop cannot substitute for generative discovery.

Specification as a Distinct Practice

The practical implication is that the most consequential work in the middle loop often happens before the first prompt. Thinking through edge cases before the code exists. Writing down invariants explicitly. Identifying failure modes that would surface through implementation and surfacing them upstream instead. This is closer to how a software architect works than how a developer works, specifying the shape of a solution before implementation begins.

This is not how most engineers have worked, and it requires a different kind of discipline. The inner loop tolerates incomplete specifications because the feedback is tight enough to iterate toward correctness quickly. The middle loop does not offer the same tolerance. Specification gaps compound rather than surface immediately, and the feedback, a subtle bug in production weeks later, is slower and more expensive to address.

Engineering practice has mature vocabulary for the poles of this spectrum. Formal specifications exist in one direction, covering everything from type signatures to TLA+ models of distributed system invariants. At the other end, conversational prompting is the minimum, a rough description of intent that leaves the AI significant latitude. The working space most engineers are in, a middle ground that is more precise than a ticket but less formal than a spec, has no established conventions, no tooling to validate completeness, and no shared vocabulary for describing quality. Prompt engineering occupies this space informally, but it is scattered across blog posts and Discord servers and not grounded in systematic study of what specification quality actually determines about output quality.

Kief Morris’s work on humans and agents on the Fowler site frames the human role in the middle loop as designing the feedback cycle rather than being inside it: deciding what correctness means before the agent runs, setting checkpoints that make evaluation tractable. That is the architectural mode in practice. It is also a description of work that requires the understanding to be externalized before generation, not formed through it.

Fowler notes that Vella’s research concluded in April 2025, and that subsequent model improvements have accelerated the shift toward supervisory engineering. Vella’s word for the experience is traumatic. Some of that is the identity question: engineers became engineers because of the generative experience of building. The middle loop offers different satisfactions. But the specification problem is also genuinely hard in a way that the pure evaluation framing does not fully capture. It asks engineers to front-load understanding they are accustomed to discovering in motion, to externalize knowledge that the inner loop kept implicit, and to do this reliably in a domain where the cost of specification gaps is borne later and elsewhere. Building practice and vocabulary for that work is part of what makes the middle loop tractable as a professional discipline, and the field has not made much progress on it yet.

Was this interesting?