What Forty Years of Automation Research Says About Supervisory Engineering
Source: martinfowler
The shift Annie Vella documented in 158 professional software engineers, from creation-oriented to verification-oriented tasks, has a specific name in a different field. Martin Fowler summarized her work recently, introducing the framing of a “middle loop”: a new layer of engineering effort between the inner loop (writing code, testing, debugging) and the outer loop (commit, CI/CD, review, deploy). AI tools are increasingly automating the inner loop. The outer loop remains largely human. The middle loop is where engineers now direct AI work, evaluate its output, and correct it when it is wrong. Vella calls this “supervisory engineering work.”
In aviation, process control, and nuclear power, researchers spent four decades studying exactly this structural shift. The literature from that work is specific, and it does not offer uncomplicated reassurance.
The Three Ironies
In 1983, Lisanne Bainbridge published “Ironies of Automation” in the journal Automatica. It became one of the most cited papers in human factors research, and its core argument has held across industries and decades. The central observation: the more reliable an automated system becomes, the worse the human operator gets at the task the system is automating, because the human has fewer opportunities to practice it. When the automation eventually fails, the human is least prepared to handle it at precisely the moment they are most needed; that is the first irony.
The second is that automated systems tend to fail in unusual situations, the edge cases where the task is hardest and operator judgment matters most. The third is that the human, having spent time in a monitoring role, is in the wrong cognitive mode for rapid high-stakes problem-solving when failure occurs. Passive vigilance is not the same mental state as active execution, and switching between them under pressure is not reliable.
Bainbridge wrote about chemical plants, aircraft, and power stations, but the structure of the argument was not specific to those domains. It followed from the relationship between automation and human skill, and that relationship does not change when you substitute a language model for a PID controller.
Fowler notes something that seems counterintuitive: Vella’s research finished in April 2025, before the latest generation of models, but his sense is that improvements in models have “only accelerated a shift to supervisory engineering.” You might expect the opposite. Better AI should mean fewer mistakes to catch, less correction needed, and less time in the middle loop. Bainbridge’s framework explains why that expectation is wrong. As AI becomes more capable at inner loop tasks, the engineer practices those tasks less, and the supervisory role expands not because the AI is worse but because the engineer’s active participation in creation has decreased. The middle loop grows as the inner loop shrinks.
The Fluency Problem
The verification task in supervisory engineering carries a specific difficulty worth naming separately. AI-generated code is syntactically fluent. A novice’s code signals its own uncertainty through style: inconsistent naming, awkward control flow, patterns copied from documentation without quite fitting the context. You can often tell from reading it that something might be wrong. AI-generated code does not carry those tells. It is well-formatted, idiomatically consistent, and internally coherent; it reads like code written by a competent engineer.
The mistakes AI makes tend to be semantic: subtle misreadings of the requirement, edge cases handled incorrectly, architectural assumptions that are locally plausible but globally wrong. Catching these requires reasoning about correctness at the level of intent. You have to understand the requirement well enough to verify the implementation against it, which is cognitively expensive and rewards deep domain understanding rather than familiarity with surface patterns.
Code review on human-written code includes a useful shortcut: you can infer the author’s intent from context, ask them questions, or check adjacent code for clarifying patterns. The human was present for the reasoning. AI-generated code was produced from a prompt, and the reasoning is not preserved unless explicitly elicited. The diff view does not tell you what the model was trying to do when it made a particular choice, and the model does not volunteer that information unprompted.
This asymmetry matters because human reviewers have historically relied on style and structure as proxies for confidence in correctness. A well-structured function with clear naming and consistent patterns has a higher prior probability of correctness than a tangled one. AI code breaks that heuristic. The fluency is a property of the generation process, not a signal of semantic correctness, so the usual shortcuts do not apply.
Skill Atrophy as a Structural Issue
The inner loop skills that AI is most rapidly automating, translating a requirement into a working implementation, debugging from first principles, building a mental model of a codebase through the act of writing it, are also the skills that make verification tractable. You can only catch what AI gets wrong if you have enough domain and architectural understanding to have an independent opinion about the correct approach.
Using AI to replace inner loop work while relying on those inner loop skills to supervise it is structurally unstable. The skills that make supervision possible are exactly the skills that get less practice as supervision becomes the primary mode of work. This maps directly onto Bainbridge’s first irony.
The timing of failures follows her second irony as well. AI tools fail most consistently in novel situations: unusual requirements, unusual interactions between components, edge cases with sparse training data coverage. Those are the situations where engineering judgment matters most, and also the situations where an engineer who has been primarily supervising AI will have the least recent practice at the relevant skills.
Aviation confronted this problem directly. Investigations following Air France 447 and Asiana 214 both cited degraded manual flying skills as contributing factors. In both cases, crews encountered situations that automation was not handling well, found themselves cognitively unprepared to take over, and made errors that more practiced manual pilots would likely have avoided. The FAA responded with Safety Alert SAFO 13002 in 2013, explicitly recommending that pilots practice manual flight more frequently to counteract automation-induced skill degradation. Airlines built minimum manual flying requirements into recurrency programs. The insight was that incidental practice would not maintain skills that were no longer the default mode of operation.
What Software Engineering Needs
The structural response in aviation and process control was not primarily tooling. It was deliberately engineering the human back into the loop on a schedule, for the specific purpose of skill maintenance. Operators were rotated through higher-workload manual tasks. Simulators were used not just for emergency training but to maintain everyday proficiency at tasks the system usually handled. In nuclear power, procedural checklists were redesigned to require active cognitive engagement rather than passive verification; researchers found that checkbox-style confirmation without genuine attention produced the vigilance decrement that Bainbridge had identified as inevitable in pure monitoring roles.
For software engineers, this suggests something specific: the engineers who will be most effective in the middle loop are those who continue to practice inner loop skills deliberately, in contexts where AI is not doing the work for them. Code written from scratch without assistance, unfamiliar codebases navigated manually, pair programming sessions where you are driving rather than reviewing. The goal is to maintain the substrate that makes verification possible, because that substrate will not sustain itself as a side effect of supervisory work.
The tooling gap matters separately. The middle loop currently lacks the infrastructure that the inner and outer loops have accumulated over decades. The inner loop has debuggers, REPLs, type systems, fast test runners, hot reload. The outer loop has CI/CD pipelines, code review tooling, monitoring, and observability systems. The middle loop, as Fowler’s framing makes clear, currently runs on a chat interface and a diff view. The gap between the cognitive demands of supervisory engineering work and the tools available to support it is substantial.
Building that infrastructure requires understanding what supervisory engineering actually demands: tools that help engineers reason about intent, surface the assumptions embedded in generated code, and track where models have been consistently wrong for a given class of task. That is a different design problem from the one that produced linters and static analyzers, though it draws on related ideas. It requires treating the human as a supervisor rather than an author and designing feedback loops accordingly. What does a “mode confusion” warning look like for a language model? What is the software equivalent of the glass cockpit alert that tells a pilot the autopilot has disengaged?
Bainbridge’s paper was written about physical systems with measurable failure modes and decades of operational data. Software supervisory engineering is newer and less instrumented. But the structural dynamics she described are not specific to those industries, and the problems her research identified are appearing in software in recognizable forms. The forty years of work her paper catalyzed in aviation, process control, and nuclear engineering is not a perfect map, but it is a far better starting point than treating the middle loop as a problem with no prior art.