Annie Vella asked a precise question when she studied how 158 professional software engineers use AI tools: are these tools shifting where engineers spend their time and effort? Because if they are, she reasoned, they are implicitly shifting which skills get practiced, and ultimately, what the role means.
Her answer was yes, and the shape of that shift is worth understanding carefully.
The Loop Model and Where AI Lands
Developer productivity research has long used a two-loop model. The inner loop is what you do in a single session: write code, run tests, read error output, fix the bug, repeat. It is tight, fast, and deeply personal. The outer loop is what happens at a larger cadence: commit, open a pull request, wait for CI, get review, merge, deploy, observe in production. It is slow, social, and process-heavy.
AI coding tools have been attacking the inner loop from multiple directions at once. GitHub Copilot and its successors fill in function bodies and boilerplate. Cursor and similar editors let you describe a change in natural language and accept a diff. Tools like Claude Code operate at an even higher level of abstraction, taking a task description and producing multi-file changes autonomously. The inner loop is not gone, but it has been compressed. What used to take thirty minutes of focused typing now sometimes takes five minutes of reviewing.
The outer loop, by contrast, has been slower to change. CI pipelines still run on the same infrastructure. Code review still happens in the same tools. The social and organizational overhead of shipping has not been meaningfully reduced by AI. If anything, the volume pressure has increased.
Vella’s research identifies a third loop sitting between these two, one that did not exist in the same form before AI tools became central to daily work. She calls the effort required to operate in it “supervisory engineering work”: directing AI, evaluating its output, and correcting it when it goes wrong.
What Supervisory Engineering Actually Demands
The phrase “supervisory work” might suggest something passive, like watching a process run and occasionally pressing a button. The reality is different. Consider what happens when you hand a non-trivial task to an AI coding agent.
First, you have to decompose the problem well enough that the AI can operate on it. This is not the same skill as decomposing a problem for yourself. When you write code yourself, your mental model of the problem and the code evolve together; you can hold loose ends in your head and resolve them as you go. When you hand a task to an AI, you need to specify it precisely enough up front that the agent can reason about scope, dependencies, and edge cases without constant interruption. Under-specification produces confident but wrong output. Over-specification produces output that technically satisfies your description but misses the actual intent.
Second, you have to evaluate the output, and this is harder than it sounds. The AI will produce code that looks correct, is syntactically valid, passes type checks, and may even pass the tests you asked it to write. The question is whether it solves the actual problem. Spotting the gap between “this code does what I described” and “this code does what I needed” requires a deeper understanding of the problem domain than the task description itself contains. Research on AI-assisted coding has consistently found that reviewers tend to over-trust plausible-looking AI output, a phenomenon documented in studies on automation bias that predates large language models by decades.
Third, when you find an error, you have to diagnose it in a system you did not write line by line. You cannot rely on the same mental map you would have if you had written the code yourself. You are debugging a foreign codebase that was generated for you, often with subtly different assumptions than you would have made.
None of these are new skills in the abstract. Decomposing problems, reviewing code, and debugging unfamiliar codebases are all things senior engineers have always done. What is new is that they now constitute a much larger fraction of total engineering time, while the creation work that used to occupy most of the day has been shifted elsewhere.
The Skills That Get Left Behind
Vella’s finding that engineers are shifting from creation-oriented to verification-oriented work has a consequence that is easy to understate: the skills that do not get practiced do not stay sharp.
Writing code from scratch is not just a means to an end. It is also the primary way engineers build a mental model of how systems behave, where the failure modes are, and what the performance characteristics look like. When you implement a cache invalidation strategy yourself, you encounter the edge cases by hitting them. When you review AI-generated cache invalidation code, you have to reason about those edge cases abstractly, without the embodied experience of having built and broken something similar.
This is not an argument against using AI tools. The productivity gains are real and the tools are getting better quickly. It is an observation about what the middle loop costs over time, particularly for engineers early in their careers. Senior engineers bring to supervisory work a reservoir of inner-loop experience that lets them spot AI errors quickly. Juniors who primarily operate in the middle loop are building a different kind of expertise, one that may be harder to accumulate and harder to measure.
The career ladder problem is real too. Most engineering leveling frameworks reward artifacts: systems designed, features shipped, incidents resolved, code written. The middle loop produces fewer artifacts that are clearly attributable to the engineer. When the AI writes the code, who gets credit for the feature? When the engineer catches a subtle correctness error in AI output, that catch may be more valuable than writing the code would have been, but it shows up nowhere in performance review systems built for the creation-heavy workflow.
The April 2025 Horizon Problem
Vella’s research concluded in April 2025. Fowler notes this as a potential limitation: models have improved significantly since then. Claude 3.5 Sonnet was the dominant coding model at that point; since then the landscape has shifted substantially toward agents that can maintain context over much longer tasks, use tools, and iterate on their own output before presenting results.
The implication is not that the research is wrong. It is probably that the middle loop has grown since April 2025, not shrunk. As models get better at the inner loop, the inner loop that engineers personally execute gets shorter, and the fraction of time spent in supervisory work increases. Better models produce more plausible output, which means more output that needs careful evaluation rather than immediate rejection. The verification work intensifies even as the creation work decreases.
There is also a second-order effect. As agents become capable of longer autonomous runs, the decomposition problem at the start of a task becomes more consequential. A mistake in how you specify a task to an agent that will run for ten minutes is cheap to catch. A mistake in how you specify a task to an agent that will run for two hours, touching dozens of files, is expensive. The quality of your problem decomposition now has a direct multiplier on the cost of errors.
Building for the Middle Loop
If the middle loop is where engineers now spend a significant and growing fraction of their time, it deserves the same kind of tooling investment that the inner loop has received. The inner loop has linters, formatters, language servers, test frameworks, debuggers, and profilers. The middle loop has, mostly, the same code review interfaces that existed before AI tools.
What would tooling built for supervisory engineering actually look like? Probably something that makes AI output provenance visible, so reviewers know which parts of a diff were written by a model and which were written or edited by a human. Probably something that surfaces uncertainty, flagging regions of generated code where the model’s confidence was lower or where the problem specification was ambiguous. Probably better integration between the task description, the generated output, and the test results, so the verification loop has a coherent view of what was asked and what was produced.
Vella’s framing of supervisory engineering work is a useful clarification at a moment when a lot of conversation about AI and software development stays at the level of productivity statistics and benchmark scores. The aggregate numbers matter. The structural shift in what engineers do every day matters more.