· 6 min read ·

The Silent Dependency in Supervisory Engineering

Source: martinfowler

Annie Vella’s research on 158 software engineers, summarized by Martin Fowler, identifies a shift from creation-oriented work to verification-oriented work. The name she gives the new mode is supervisory engineering: directing AI, evaluating its output, correcting it when wrong. The framing is accurate and useful. But it leaves open a question that the data cannot answer, and that nobody has a clean solution to yet.

Supervisory engineering presupposes something. To evaluate whether AI-generated code is correct, you need to know what correct looks like. To catch a security flaw in generated code, you need a mental model of what attack surfaces exist and how they manifest. To recognize that an AI-written migration will deadlock on a large table, you need enough database experience to know why that happens. The supervisory skill is downstream of implementation experience. You cannot evaluate what you have never built.

This creates a circular dependency in the shift Vella documents. The more time engineers spend in the supervisory mode, the less time they spend in the implementation mode that builds the knowledge supervisory work requires. At the individual level this is manageable for experienced engineers: the reservoir built over years draws down slowly. At the organizational and generational level, it is a different problem.

What Changes When You Stop Writing the Code

The inner loop, as developers have traditionally experienced it, is not just a productivity cycle. It is a feedback mechanism that builds intuitions that do not transfer well through reading or instruction. You learn that off-by-one errors cluster around loop termination by making them repeatedly until the pattern becomes visible. You learn how memory pressure degrades application behavior by observing it under load. You develop a sense for which code paths will be slow by writing code and profiling it, not by studying algorithmic complexity in the abstract.

These intuitions are what a supervisory engineer draws on when evaluating AI output. Consider a generated function like this:

def get_active_users(db, min_logins=10):
    users = db.query("SELECT * FROM users").fetchall()
    return [u for u in users if u.login_count >= min_logins]

The function fetches every row into application memory before filtering in Python. A developer who has traced a production query performance problem through a profiler recognizes this immediately. Someone without that experience sees a function that passes its tests and approves it. The lesson arrives six months later, under production load, in a way that is expensive and disorienting.

Recognizing an SQL injection in AI-generated query construction requires understanding how parameterization works at the driver level, not just knowing that injection exists as a category. Catching a race condition in generated concurrent code requires having debugged race conditions before. The GitClear research from early 2024 found meaningfully higher code churn in AI-assisted codebases: code written, then reverted or substantially revised shortly after. One plausible explanation is that supervisory review is accepting outputs that are subtly wrong, in ways that only surface later. That would be consistent with supervision that is under-informed, not lazy or careless, just missing the pattern recognition that comes from having been burned by similar code before.

The Junior Engineer Problem Is the Acute Version

The concern about experienced engineers gradually drawing down their implementation reserves is real but slow-moving. The more acute version involves engineers who are entering the profession now.

Software engineering has traditionally developed skills through exposure. A junior engineer working on a production codebase would spend years writing real code, debugging failures, reading unfamiliar codebases before modifying them. The inner loop was where craft knowledge accumulated: not in the concepts learned from documentation, but in the repetitions, the mistakes, and the debugging sessions that followed.

If the inner loop is substantially automated, junior engineers may spend their formative years evaluating AI output, accepting or rejecting generated code, writing prompts. The question is whether that experience builds the substrate that makes supervisory judgment reliable. The honest answer is that nobody knows yet. The data simply does not exist for a generation of engineers trained primarily in supervisory mode.

The aviation domain has a documented version of this problem. Studies following increased autopilot use in commercial aviation found that manual flying skills degraded among pilots who relied heavily on automation, and that degradation was not always visible until an unusual situation required manual intervention. The response was mandatory manual flight training hours and periodic proficiency checks specifically designed to maintain skills that automation had made routine to bypass. Software engineering has no equivalent policy, and it is not obvious what one would look like.

A Stanford study from 2022 found that developers using AI coding assistants were more likely to introduce security vulnerabilities than those who were not. The cause was not malice or carelessness; it was that the AI-generated code often omitted the defensive checks that experienced developers write by habit, and the developers reviewing it did not catch the omissions. The supervisory skill failed because the underlying pattern recognition was not there.

What Deliberate Practice Might Actually Mean

The aviation analogy points toward an implication that organizations have not thought through carefully yet. If supervisory engineering is the primary mode of daily work, and if that mode is insufficient to maintain the implementation fluency that supervision requires, then implementation practice needs to be maintained through something other than daily work.

This is not an argument against AI tools. It is an argument that using AI tools competently at a high level may require deliberate off-AI practice, in the way that high-level mathematical competence requires drilling operations that calculators can handle. The surgeon who uses robotic assistance still trains on cadavers. The practice maintains a capability the workflow has otherwise deprioritized.

What this looks like in practice is not fully worked out, but the shape is visible. Debugging sessions conducted without AI suggestion as a form of practice rather than a constraint. Architecture decisions made before AI tools are consulted, then compared to what the tools suggest. Code exercises in domains where AI assistance is unavailable or disabled. These are not productivity activities in the near term. They are investments in the quality of the supervision those tools will receive.

The METR 2025 study on AI assistance for real software engineering tasks found roughly 20% average time savings with high variance, and noted that the average obscured a distribution with a long left tail: some tasks improved substantially, others regressed. The regression cases are likely where supervisory judgment mattered most and where the required pattern recognition was thin.

The Identity Piece Fowler Is Right To Name

Fowler calls this shift a traumatic change, and that framing is precise rather than dramatic. Engineers who built their professional identity around the act of creating code face a shift where that activity is increasingly automated. The transition is not just a skills question; it is an identity question.

The Vella research suggests this is being experienced, not theorized. Engineers in the study report the shift happening in their daily work. What remains an open question is whether engineers who adapt well to supervisory work are those who maintain implementation practice as a deliberate activity, or whether the adaptation proceeds through other mechanisms.

The Stack Overflow Developer Survey from 2024 found that 76% of developers were using or planning to use AI tools, but only 43% reported high trust in the output. That gap between adoption and trust is where supervisory engineering lives. The engineers navigating it most effectively have probably calibrated their trust through enough implementation experience to know where specific tools tend to fail, rather than arriving at their judgments through intuition alone.

That calibrated distrust is the thing at risk if the bootstrap problem is real. And the data Vella gathered, incomplete as it is given the April 2025 cutoff before the current generation of models, suggests it is.

Was this interesting?