From Making to Supervising: What the AI Coding Shift Actually Costs the Profession
Source: martinfowler
Martin Fowler ended his summary of Annie Vella’s research on 158 professional software engineers with the word “traumatic.” He was describing the shift from creation-oriented engineering work to what Vella calls supervisory engineering work: the effort required to direct AI, evaluate its output, and correct it when it is wrong. That word is chosen carefully by a careful observer. The structural part of the argument — that a new middle loop has emerged between the inner loop of write-test-debug and the outer loop of commit-review-deploy — is worth taking seriously on its own terms. But “traumatic” points at something the structural framing alone does not fully explain. It points at professional identity.
When the Artifact Is No Longer Yours
Software engineering has always been defined by building. The canonical description of the job is writing code, which means producing the primary artifact of value. Engineers are evaluated on what they ship. Senior engineers are people who have built more and harder things. The craft has never been easily separable from the production of code, and most of the language engineers use to describe their work — “I built this,” “I wrote this,” “I shipped this” — reflects that inseparability.
Supervisory engineering produces no artifact in that sense. The supervisory engineer’s primary output is the quality of their evaluation. An engineer who correctly identifies that an AI-generated authentication handler contains a session fixation vulnerability has done their job well; the code that ships correctly is indistinguishable in the final artifact from code a skilled human engineer wrote directly. The supervisory work is invisible in the output. The value it created is real and load-bearing, but it does not produce anything you can point to.
This is not a minor adjustment to the job. It is a shift in what “doing the job well” means, at a level deep enough that existing career frameworks, performance evaluation systems, and professional self-concepts do not accommodate it cleanly. The engineers Vella studied were navigating this in real time, and what she found was a kind of disorientation: the work they were doing felt important and demanding, but the frameworks they had for understanding their own contribution did not quite fit.
What Other Professions Learned About This Transition
The analogy that has dominated discussion of AI and automation is industrial process control or aviation — fields where operators supervise increasingly automated systems. Those analogies are useful for the skills question. For the identity question, a closer set of parallels comes from creative professions that were restructured by desktop tools in the 1980s and 1990s.
Graphic designers in the early 1980s built careers around production craft: hand-lettering, paste-up, phototypesetting, mechanical preparation of print layouts. These were skilled technical practices. Desktop publishing software — PageMaker, then QuarkXPress, eventually the Adobe suite — absorbed the production workflow entirely within about a decade. Designers did not lose their jobs, but the job changed. The practitioners who adapted best were the ones who recognized that their domain knowledge transferred into judgment about what the tools produced. The ones who struggled were the ones whose professional identity was inseparable from the specific production methods that had been automated.
The same pattern played out in architecture with CAD, in editorial with digital layout tools, in audio production with digital audio workstations. In each case, skilled practitioners found that judgment became their primary value precisely when the tooling removed the need for them to produce by hand. The practitioners who got through it did not abandon production skill; they maintained enough of it to evaluate correctly. What they changed was their relationship to it: production became the substrate for evaluation rather than the point.
What software engineering is navigating now is structurally similar, with one important difference. The previous transitions moved production work into specialized tooling that operated on well-defined domains. Desktop publishing handled layout; it did not write the text. CAD handled draftsmanship; it did not design the building. AI coding assistants operate in the same conceptual space as the engineer, which means the supervisory skill required is more demanding. Evaluating AI-generated code requires understanding both what the code should do and whether what the AI produced actually does it, for a system the AI only partially understands.
The Measurement Problem
The profession’s metrics amplify the identity disruption. Software engineering has spent years building productivity measurement frameworks oriented around production: deployment frequency, commit velocity, PR cycle time. The DORA framework, which has become the standard for engineering organization health, measures the outer loop. DX Research has done useful work on inner loop measurement. Neither framework has instrumentation for the middle loop, because the middle loop produces no output that current tooling can count.
An engineer who accepts AI output uncritically and an engineer who catches every error before it reaches review will, in current metrics, look identical until defect rates in production diverge. The supervisory quality is invisible to the measurement system, and invisible work does not get rewarded, resourced, or included in performance reviews.
A GitClear analysis of 211 million lines of code found a significant increase in code churn correlating with AI tool adoption, meaning code written and then substantially modified within two weeks. The churn shows up eventually as change failure rate in outer loop metrics, but the supervisory gap that caused it happened long before any metric registered a problem. The measurement system reports clean right up until it does not, and the engineers doing careful supervisory work are indistinguishable from the ones doing negligent supervisory work until then.
This creates a specific kind of pressure on engineers who take supervisory work seriously: they carry a cognitive overhead that peers who skip it do not carry, and productivity metrics do not capture the difference. That pressure compounds the identity question. The craft is becoming supervisory; the measurement system still rewards production speed.
The Development Pipeline
The question that has received least attention is what this means for engineers who are early in their careers or have not yet entered the profession. The standard developmental path has been stable for decades: write code, break things, fix things, and build intuition for where correctness lives through repeated production practice. The inner loop was where engineering judgment developed. Debugging sessions, refactors, the experience of tracing an unfamiliar codebase to find a subtle error, these are the experiences that built the evaluation capacity that supervisory engineering now requires.
Vella’s research found that AI tools provide the largest productivity uplift to engineers who already have deep domain knowledge, because they can evaluate AI output accurately. A Purdue University study found that Copilot produced incorrect code in approximately 40% of test cases while developers accepted suggestions at high rates. The problem the study identifies is that incorrect AI output looks like correct AI output; distinguishing them requires the domain knowledge that traditional inner loop practice built. Engineers who grow up primarily in the supervisory mode may get generation speed without the evaluation capacity that makes supervision effective.
Lisbeth Bainbridge’s 1983 paper on industrial automation described this precise dynamic: the more capable an automated system becomes, the more critical human oversight becomes when the automation fails, and the less opportunity humans have to practice the skills needed for that oversight. Aviation addressed this with mandatory manual flying requirements and structured recurrent training. Medicine maintains residency structures that preserve hands-on skill development even as diagnostic tooling improves. Software engineering has no equivalent. There are no mandatory production coding requirements, no structured recurrent training for the condition where AI-generated code is subtly wrong.
The skills that effective supervisory engineering requires — reading unfamiliar code critically enough to catch semantic errors that look syntactically fine, recognizing when an AI has misunderstood the problem rather than the implementation, identifying when locally correct code violates global architectural constraints — are not learned by prompting AI tools. They develop through production practice. If the profession automates away production practice without building structured alternatives, it will produce engineers who are fluent at directing AI and poor at evaluating what it produces.
What Traumatic Actually Means
Trauma, in the sense Fowler is gesturing at, describes a disruption that outpaces the construction of frameworks to make sense of it. The definition of competent engineering work is changing faster than the field’s career development paths, team structures, hiring criteria, and productivity measurement frameworks can adapt. Engineers are being asked to do work that matters in ways that existing professional scaffolding does not recognize or support.
Practitioners from other creative professions that made the same transition say the same thing about the intermediate period: the work was real and demanding, the tools were genuinely useful, and the frameworks for understanding what they were doing lagged by years. The professions that navigated it best built explicit frameworks for supervisory judgment as a discipline, maintained production skill deliberately rather than letting it atrophy, and invested in the training paths that preserved the epistemological foundation the automated tools required.
Vella’s contribution is naming the middle loop clearly enough that the field can reason about it. Fowler’s contribution is recognizing that the structural naming is not enough on its own, that there is something identity-level happening that deserves the weight of the word traumatic. What comes next is harder: developing the measurement infrastructure to make supervisory quality visible, building the training frameworks that preserve production judgment while accommodating AI assistance, and constructing the professional language for what it means to do supervisory engineering well. Those are organizational and institutional problems, not individual skill problems, and they will not be solved by individual engineers figuring out better prompting strategies.