Aviation Automated Expert Work Decades Before Software Did. Here's What It Learned.
Source: martinfowler
The framing in Martin Fowler’s latest fragments post, drawing on Annie Vella’s research into how 158 professional software engineers work with AI, is that a new “middle loop” is emerging between writing code and shipping it. AI handles more of the inner loop. Engineers supervise that work, evaluate the output, and correct what goes wrong. The shift from creation to verification is real, documented, and accelerating.
What the software engineering discussion has been slower to absorb is that other industries have been through this transition before, have made serious mistakes in how they handled it, and have developed institutional responses worth studying. Aviation has been here, and its experience amounts to a detailed case study in what happens when expert execution gets automated and supervision takes its place.
The Air France 447 Problem
On June 1, 2009, Air France Flight 447 crashed into the Atlantic Ocean after the autopilot disconnected due to ice-crystal contamination in the pitot tubes. The aircraft entered an aerodynamic stall that was recoverable with proper technique, but the pilots never recovered it.
The investigation found that the primary contributing factor was loss of situation awareness and inadequate manual flying skills among pilots who had spent the vast majority of their careers with the autopilot engaged. The automation worked as designed; it disengaged when it could no longer make reliable flight decisions, which was exactly what it was supposed to do. The failure was in the supervisory layer. The pilots were experienced, qualified, and certified. They had the required hours. But modern commercial aviation had so thoroughly automated the inner loop of flight that pilots rarely engaged in manual stick-and-rudder flying during normal operations. When the automation handed control back, the supervisory engineers were not prepared to take it.
The FAA and EASA response was systematic. They introduced mandatory manual flying hours requirements, recurrent training specifically designed for automation-dependent failure scenarios, and new standards for what “proficiency” means in an age of flight management systems. These regulations explicitly required that pilots maintain skills that are never needed under normal operations, because those skills define the ability to supervise competently when the automation fails or goes wrong.
What Aviation Developed That Software Engineering Has Not
The institutional response in aviation did several things that software engineering has not yet developed equivalents for.
First, aviation established minimum manual practice requirements. A commercial pilot certificate requires demonstrating manual flight proficiency regardless of whether the autopilot can handle everything better. The reasoning is explicitly about supervisory competence: you cannot reliably supervise something you can no longer do.
Second, aviation developed scenario-specific training for automation failure modes. Simulator training requirements were extended to include scenarios that specifically exercise judgment in automation-degraded conditions. The training is not just about using the automation well; it covers what happens when the automation makes a wrong decision and the human needs to recognize and override it.
Third, aviation formalized crew resource management (CRM) as a distinct discipline. CRM covers how humans work together in high-automation environments: how to communicate when automation takes ambiguous action, how to maintain shared situational awareness across a crew that is monitoring rather than operating, how to challenge a decision made by a system rather than a person. The software engineering equivalent, code review, was not designed for the supervisory engineering context and does not address any of these questions.
The Medical Imaging Parallel
Radiology offers a different version of the same transition. Automated detection systems, now using deep learning, identify anomalies in imaging studies with sensitivity that approaches trained radiologist performance in specific, well-defined tasks. The productivity gains are real. So is the supervisory challenge.
The documented problem in radiology is automation complacency: the tendency of radiologists to reduce their scrutiny of a scan when an automated system has flagged nothing. Studies from the early 2010s onward found that radiologists who reviewed a scan after seeing a “no finding” result from an automated system missed significantly more lesions than radiologists who reviewed without the automated signal. The automation was not degrading individual radiologists; the supervisory role it placed them in was changing how they read.
Radiology’s response included training programs specifically designed to maintain active evaluation skills alongside automated tools, and quality standards that treat supervisory effectiveness as a measurable competency. Calibration assessment became part of professional practice: how accurate is your confidence, and where does it systematically diverge from actual error rates?
Software Engineering’s Current Exposure
The software equivalent of automation complacency is already visible in the data. Vella’s research captures it qualitatively. Quantitatively, the GitClear 2024 report found meaningfully higher code churn rates in AI-assisted codebases, which is partially a measure of inadequate supervision: code accepted from AI output that was later recognized as wrong and had to be revised or reverted. The Pearce et al. (2022) finding that roughly 40% of GitHub Copilot suggestions in security-sensitive contexts contained vulnerabilities reinforces this; the supervisory layer is not catching everything it needs to catch.
What software engineering currently lacks, compared to where aviation and radiology are:
- Proficiency standards for manual coding that persist even as AI assistance becomes the norm. There is no equivalent to the manual flight hours requirement. An engineer who has not written a function from scratch in six months has no formal obligation to maintain that capability.
- Structured training for AI-specific failure modes. What does a hallucinated API look like in practice? What are the characteristic error patterns in AI-generated authentication logic? How do you recognize when an AI has solved a slightly different problem than the one you posed? These failure modes are identifiable and teachable, but the field has not yet systematized that knowledge.
- Calibration assessment. How confident should a developer be when accepting AI-generated code, and how does that confidence compare empirically to the actual error rate for different task types? Radiology has developed systematic approaches to this question; software engineering has not.
- A formal equivalent of CRM. How do teams share supervision responsibility when AI generates code that multiple engineers will eventually touch? How do you communicate about the provenance and verification status of AI-generated components across a codebase?
The Cognitive Foundation
The cognitive science behind these concerns is specific enough to be actionable. The generation effect is a well-replicated finding in memory research: information you produce yourself is retained more durably and accessibly than information you passively encounter. When an engineer writes a function, they exercise the same mental models that will let them recognize when an AI writes that function incorrectly. When they review AI output, they exercise a different, shallower encoding of the same knowledge.
This does not mean supervision is impossible without active generation. It means that the depth of the mental model supporting supervision degrades if it is not regularly refreshed through production. For the domains where AI code is most error-prone, including security-sensitive logic, complex concurrency, and domain-specific correctness requirements, deliberate inner-loop practice is maintenance of a supervisory capability rather than inefficient nostalgia.
Knowing how to implement a JWT handler from scratch is not valuable because you will often write one by hand when the AI can produce it in seconds. It is valuable because that knowledge is what makes you a reliable judge of whether the AI’s attempt is actually correct, and whether the edge cases it handled match the ones that matter in your system.
The Institutional Gap
Vella’s research identifies a real transition. The question her work opens is what institutional response software engineering will develop. Aviation’s response was not to slow automation or preserve roles that automation could handle better. It was to be systematic about what the supervisory role requires and to build training and standards around maintaining that capability. The result is a profession where automation does most of the work, the human is responsible for a narrower and more demanding form of judgment, and there are explicit standards for what that judgment requires.
Software engineering is earlier in that process. The automation is here; the supervisory role is emerging; and the standards for supervisory competence have not been written. Vella’s work is one input to that process. The aviation and radiology precedents suggest that developing those standards early, before a series of high-profile failures makes them obviously necessary, is worth the effort.
The specific things that would move the field forward are not complicated in concept, even if they are difficult to standardize across a heterogeneous industry: formal failure mode taxonomies for common AI coding errors, deliberate practice requirements that maintain inner-loop skills, and team protocols for tracking the provenance and verification depth of AI-generated code. None of these exist in any widespread form today. Aviation had to learn their necessity the hard way. Software engineering does not have to.