· 5 min read ·

The Career Formation Problem in AI-Assisted Engineering

Source: martinfowler

Annie Vella’s research, covered in Martin Fowler’s March 16 fragments post, coins a useful term: supervisory engineering. In a study of 158 professional software engineers concluded in April 2025, Vella found that practitioners were spending increasing time directing AI systems, evaluating their output, and correcting them when they diverged from the intent. Fowler adds a structural frame: a middle loop, sitting between the fast inner loop of write/build/test/debug and the slow outer loop of commit/review/CI/deploy/observe. The middle loop runs in minutes to an hour. It is where prompt/evaluate/accept or reject/re-prompt lives.

This frame clarifies something important about where the profession is now. But it also surfaces a structural problem that the research period could not fully capture, because the problem is generational.

What the Middle Loop Requires

Supervisory engineering is not a lesser version of implementation work. Evaluating AI-generated code requires understanding what correct code looks like, what failure modes are plausible, what security properties the domain demands, and which edge cases the model is likely to miss. When you write code yourself, the act of writing surfaces these questions. You hit the bounds check. You see the race condition. You read the postmortem. The knowledge accumulates through friction.

The verification paradox is that catching subtle AI errors requires knowing the domain well enough to recognize wrong. That knowledge is not acquired by supervising AI. It is acquired by implementing things, making mistakes in those implementations, and developing a mental model of how this class of code behaves under adversarial conditions.

The GitClear 2024 research on AI-assisted codebases found higher code churn rates than non-AI codebases, meaning code written and then significantly modified or reverted at higher frequency. That is a supervision failure signal: code that looked acceptable at merge time but did not hold up. A Stanford 2022 study found that developers using AI coding assistants were more likely to introduce security vulnerabilities than those who were not. Neither of these findings says AI assistance is net negative. They say that the quality of the supervisory layer matters, and the supervisory layer is only as good as the judgment of the person running it.

The Experience-Level Inversion

Senior engineers get the most benefit from current AI tools. Junior engineers report the most problems. This is not surprising once you understand what supervision requires. A senior engineer with a decade of inner loop work has a large reservoir of implementation knowledge to draw on. When an AI-generated diff contains a subtle concurrency bug or an off-by-one in a boundary condition, the senior engineer’s pattern recognition catches it. They have seen that failure before, or something structurally similar.

A junior engineer in month three of their career, running an agentic coding tool, may not have that reservoir. The code ships. The tests pass. The review looks fine because the reviewer is also relying on AI-assisted review. The supervision failure is invisible until production, under load, or under attack.

The SWE-bench frontier model resolution rate climbed from around 12% in early 2024 to above 40% by mid-2025. Model capability improvements have accelerated the absorption of inner loop work, not reversed it. Fowler notes this directly. The direction of travel is clear. What is less clear is what it does to the careers of people who enter the profession now.

The Production Lag Problem

The engineers currently doing supervisory work at scale grew up in the inner loop. They spent years writing code, finding bugs, reading other people’s code, and developing implementation instincts through repetition. They are now applying those instincts to the supervision problem. The reservoir is large and was built over time.

Engineers entering the profession today face a different situation. The inner loop is being absorbed from day one. A new hire at many companies will spend their first months primarily in the middle loop: describing tasks, evaluating output, accepting or rejecting diffs. They will ship code. The productivity numbers will look acceptable. But the implementation substrate that effective supervision depends on is not being built through this work.

You cannot tell from the output whether a supervisory engineer has that substrate. Code ships either way. But the quality of supervision differs in ways that only become visible when something goes wrong in a way the model has not seen before, or when the failure mode is one that requires deep domain knowledge to even recognize as a failure.

Aviation faced a version of this problem. Autopilot handles most of cruise flight, and modern commercial aircraft could fly many routes with minimal pilot input under nominal conditions. But the FAA requires pilots to train extensively on manual flight before certification, and added startle-and-surprise training requirements specifically because automation complacency incidents showed that pilots who had not maintained manual skills could not intervene effectively when automation encountered something outside its parameters. The manual training exists not because manual flight is the expected mode, but because recognizing when automation is doing something wrong requires deep understanding of what it should be doing.

Inner loop practice is the engineering equivalent of that manual flight training. It is not inefficiency. It is maintenance of the evaluation substrate.

The Hiring Signal Problem

The traditional technical interview has well-documented limitations, but it tested at least one thing: can this person implement something correctly under time pressure, without assistance. The criticisms of whiteboard interviews are mostly valid. The alternative, giving candidates a task and watching them use AI tools, is a better proxy for the actual middle loop work. But it does not clearly test the judgment layer.

A candidate who is fluent with AI coding tools can produce working code for a moderately complex task in an interview without knowing whether the code is secure, whether it will hold up under load, or whether the approach is architecturally correct. The output looks fine. The process looked fluent. The interview passed.

What the interview did not surface is whether the candidate has the implementation substrate to evaluate what the AI produced. That knowledge is not visible in an hour-long session. It becomes visible when something breaks in production and someone needs to diagnose it from first principles.

What Deliberate Practice Looks Like

There is no obvious institutional response to this problem yet. But the shape of a response is visible in adjacent fields.

Deliberate inner loop practice, writing code without AI assistance on purpose, periodically, functions as skill maintenance rather than gatekeeping. Code katas and exercises done by hand build the pattern recognition that supervision requires. Implementing something in a domain before using AI assistance in that domain establishes the baseline mental model. Reading AI-generated diffs more carefully than you would read a human’s diff, because AI output is syntactically fluent and plausible-looking in ways that make errors harder to spot, is a supervision discipline worth developing explicitly.

For teams onboarding junior engineers, there is a reasonable case for structured inner loop rotations: periods where the engineer is expected to implement things without AI assistance, not to slow them down, but to build the substrate that will make their supervision more valuable over the following years. Whether organizations have the patience for this, given the productivity pressure around AI adoption, is a separate question.

Fowler’s framing of this shift as a traumatic change to the profession is accurate. The engineers who navigate it well will likely be those who treat implementation skill as something that requires active maintenance even when the tools make it optional, and who recognize that the middle loop is only as reliable as the judgment layer underneath it.

Was this interesting?