Annie Vella studied 158 professional software engineers to understand how AI tools were changing their work, and her research, summarized by Martin Fowler, produces a useful framing: a middle loop is forming between the inner loop (write, build, test) and the outer loop (commit, review, deploy, observe). The work that happens there she calls supervisory engineering, directing AI, evaluating its output, correcting it when wrong.
The framing names something real. But it also describes a practice that engineers in one specialization have been doing for years under a different name, and that specialization developed feedback mechanisms and practice vocabulary that the middle loop will need to develop independently or borrow.
Site reliability engineering is supervisory engineering over production systems. SREs do not write the application code they run. They write automation that manages applications, monitor metrics to evaluate whether systems are behaving correctly, and intervene when they go wrong. Directing a system, evaluating its output, correcting it when wrong: Vella’s definition fits almost exactly, applied to production infrastructure rather than code generation.
This parallel is worth taking seriously not because SRE and AI-assisted coding have identical problems, but because SRE had to build a practice vocabulary and feedback infrastructure for supervisory work from scratch. That process took twenty years, and what it produced is specific, teachable, and potentially transferable.
What SRE Built
Google introduced the SRE discipline around 2003. The first SRE book from Google appeared in 2016. Between those two points, the discipline developed a set of concepts that made supervisory work on systems tractable.
Service Level Objectives are the most important. An SLO is a formally stated threshold for acceptable error, not a target for perfection. A system can be down 0.1% of the time and still be meeting its SLO. Having a stated threshold makes supervisory decisions answerable. The engineer on call can ask “are we within our error budget?” rather than “is this good enough?”, and the first question has an observable answer.
The error budget follows from the SLO: if you commit to 99.9% availability, you have approximately 43 minutes per month of allowed downtime. That budget accumulates, is observable, and creates a signal for supervisory decisions. When the budget is nearly exhausted, you become more conservative. When it is flush, you can take more deployment risk. The budget is not just a metric; it is a feedback mechanism that shapes behavior.
The postmortem is SRE’s retrospective mechanism. When a system failure reaches production, the postmortem documents what happened, why monitoring did not catch it earlier, what the supervisory failure was, and what specific changes reduce the probability of recurrence. The process is blameless by design, aimed at improving the system rather than attributing fault, which makes postmortems honest rather than defensive.
Toil recognition is the third key concept. SRE distinguishes manual, repetitive, automatable work from engineering work that scales. The framework gives SREs a principled way to decide which supervisory tasks should be automated and which require human judgment, preventing both under-automation (too much toil) and over-automation (human judgment eliminated where it matters).
The Middle Loop Has None of This
The middle loop has no equivalent to an SLO. There is no stated acceptable rate for AI-introduced bugs per sprint, no error budget for supervisory failures, no observable threshold that tells you whether your AI-assisted workflow is performing within acceptable parameters. The engineer directing AI has no way to answer “are we within our error budget?” because no error budget has been defined.
The middle loop has no systematic postmortem process. When an AI-introduced bug reaches production, the investigation typically reconstructs what the AI did and why the review missed it. There is no standard framework for that investigation, no shared vocabulary for failure modes, and no systematic documentation of patterns that would help the next engineer recognize the same failure mode before it ships. The inner loop accumulated this kind of knowledge through code review culture; the outer loop accumulated it through incident retrospectives. The middle loop has not built either equivalent.
The middle loop also lacks a framework for deciding which supervisory checks to automate and which to keep as human judgment. There are emerging tools for automated code review and AI output evaluation, but no principled approach to deciding where automation stops and human oversight begins. SRE would call this the toil question. The middle loop has not named it yet.
Why the Signal Problem Is Different
SRE’s feedback mechanisms work because production systems produce natural signals. Latency, error rate, throughput: these are observable because production traffic is real and continuous. The SLO is checkable because the system either served requests within the latency target or it did not.
The middle loop’s quality signals are harder to measure. Whether AI-generated code is correct depends on the complete intended behavior of the system, which may not be captured in tests, may not manifest until edge conditions arise, and may depend on requirements that were never fully specified. The production signal arrives, but it arrives late, often months after the supervisory decision that allowed the code to ship.
This makes the middle loop’s measurement problem harder than SRE’s, not a variation on the same problem. Some approaches exist: tracking AI-generated code churn rates as a proxy for supervisory quality is a measurement GitClear has applied to real codebases, finding elevated churn in AI-assisted projects relative to human-written code. Treating the rate of security vulnerabilities in merged AI-generated code as a form of error budget is another option. These signals are coarse and lagging. They are also better than nothing, and SRE started with coarser signals than it uses today.
The measurement gap also changes what kind of postmortem is possible. SRE postmortems can pinpoint the moment a metric crossed a threshold, correlate it with a deployment, and trace the causal chain with some confidence. Middle loop postmortems on AI-introduced bugs face a harder attribution problem: the supervisory failure could have been in the prompt, the specification, the review, or the test coverage, and the artifact does not record which. Blameless postmortems help, but the raw material for them is thinner.
The Timing Problem
SRE developed gradually alongside the systems it was supervising. The tooling and practice vocabulary emerged over years in response to real problems at real scale. The SRE book appeared thirteen years after the discipline was founded at Google. The middle loop is forming in a compressed timeline, driven by model capability improvements steep enough that practices stable eighteen months ago are already outdated.
Fowler notes that Vella’s research concluded in April 2025 and that model improvements since then have accelerated the shift toward supervisory engineering. That means the middle loop needs to develop its SLO equivalents, its postmortem culture, and its toil recognition framework at a pace SRE never had to match.
The structural insights SRE produced are available to adapt. Supervisory work needs explicit error thresholds, retrospective processes, and principled automation boundaries. These are not SRE-specific ideas; they are properties of any supervisory discipline that functions reliably at scale. Taking them seriously for the middle loop is faster than reinventing them from scratch, and the middle loop does not have the luxury of a thirteen-year development window to find out the hard way.