The Confidence Problem That Makes AI Supervision Hard

Annie Vella spent time studying 158 professional software engineers and their AI-assisted workflows, and her central finding is one that a lot of people sense without having a clean name for. Martin Fowler picked up on her work in a recent post and added a structural frame that clarifies the phenomenon: engineers are shifting from what Fowler calls the inner loop (write, build, test, debug) to a new middle loop that sits between that and the outer loop of commit, review, CI/CD, and deploy. The middle loop is supervisory work, and it runs on a timescale of minutes to an hour: prompt, evaluate, accept or reject, re-prompt.

The conversations about this shift have mostly focused on two challenges: the volume problem (AI generates code faster than humans can write it, which expands the surface area you’re reviewing) and the domain knowledge problem (catching subtle errors requires deep familiarity with the problem space). Both are real. But there is a third challenge that receives less attention and that I think is actually more structurally interesting: the confidence problem.

What Makes Human Code Review Legible

When you review code written by a human colleague, the code itself carries epistemic signals. A junior developer’s implementation of an unfamiliar pattern tends to look uncertain: variables named temp or result, functions that handle the happy path cleanly and accumulate awkward special-cases around the edges, comments that explain what the code does rather than why. A senior developer working in a domain they know well tends to write differently: the edge cases are anticipated and named, the structure reflects a mental model of the problem, the abstractions chosen have a particular shape that reflects experience.

These signals are not infallible guides to correctness, but they are information. They tell you where to look more closely. A reviewer who spots a function with three levels of nested conditionals and a variable named x knows to slow down. The code’s surface reflects the state of understanding that produced it.

AI-generated code does not have this property. The confidence of the output is nearly uniform across correct and incorrect generations. A function that correctly implements a binary search and a function that has an off-by-one in the boundary condition look, at a glance, structurally identical. Both are cleanly formatted. Both have descriptively named variables. Both fit the surrounding code style. The legibility markers that signal caution in human code are absent.

This is not a bug in the AI; it follows from how these models work. Language models are trained to produce text that is stylistically consistent with high-quality training data. The model has no persistent representation of uncertainty about a specific implementation choice that would surface as code that looks uncertain. Uncertainty, if it appears at all, shows up in the probability distribution over tokens, not in the surface form of the output. By the time you’re reading the diff, that distribution is invisible.

The Calibration Gap

Traditional code review lets reviewers use authorship as a prior. You know which engineers write rock-solid concurrent code and which ones produce subtle race conditions, not because of favoritism but because of observed history. That prior informs how deeply you probe different parts of a PR.

With AI-generated code, you cannot use authorship as a prior in the same way. You can use tool-level priors: specific models have known tendencies toward specific failure modes. Models trained primarily on Python tend to produce Python idioms that don’t translate cleanly when generating other languages. Copilot-era models were notoriously unreliable about security-sensitive constructs like query parameterization. The current generation hallucinates library APIs at a lower rate than earlier generations but still does it in areas with sparse training data.

Building this calibration takes time and requires deliberately tracking errors. When AI output is wrong, the productive question is not just “what did it get wrong” but “is this a class of error I should be watching for from this tool on this type of task.” That kind of cataloguing turns individual errors into a personal map of a tool’s failure topology.

The Stanford 2022 research on AI coding assistants and security found that developers using these tools introduced more security vulnerabilities than those who were not. The explanation wasn’t that the developers were less careful; it was that the AI-generated code omitted defensive checks experienced developers write by habit, and the reviewers didn’t catch the omissions. That is an uncalibrated supervision failure: the reviewers were not looking for what was missing, they were evaluating what was there.

Watching for absence is different from watching for incorrectness, and it requires knowing in advance what should be present. AI tools will generate a function that handles the success case cleanly and omit the cleanup on the error path. They will generate authentication logic that compares passwords without timing-safe comparison. They will generate a migration that works fine on a small table and deadlocks under load on a large one. None of these look wrong; they look like plausible code that is missing something. You only catch the absence if you know to look.

What Calibrated Supervision Looks Like

The engineers I’ve seen supervise AI output most effectively have developed specific, conscious interrogation patterns. Not generic “review the code carefully” discipline, but targeted questions they ask about specific classes of output.

For authentication code: was secrets.compare_digest or its equivalent used for any comparison that involves user-provided data? For database access in Python: is the filter applied in the query or in application code after fetching? For any async code: does every code path that acquires a resource release it, including on exception? These are not checks that emerge from reading carefully; they are checks that emerge from having been burned by, or having debugged, each of these failure modes.

This is calibration in the same sense that a security auditor has a list of checks they apply to code regardless of who wrote it. The list is built from known vulnerability patterns and updated as new patterns are discovered. A supervisory engineer’s calibration checklist is personalized to the specific tool they are using and the specific domain they are working in, but the structure is the same: known failure modes, actively interrogated.

The METR 2025 study on AI assistance for real engineering tasks found roughly 20% average time savings with significant variance, noting that some tasks improved substantially while others regressed. The regression cases are probably where uncalibrated supervision allowed subtle errors through, and where those errors cost more to find and fix downstream than they saved at generation time.

The Uncomfortable Implication

Fowler notes that Vella’s research concluded in April 2025, before the latest generation of models substantially improved software development capabilities. His assessment is that the improvements accelerate the shift to supervisory engineering rather than reversing it. Better models means more of the inner loop gets absorbed, not less, which means the middle loop grows in scope and importance.

But better models also means the calibration problem gets harder, not easier. A model that generates more correct code, more confidently, provides less feedback signal about where its failures cluster. The errors become rarer, which means they are encountered less frequently, which means calibration takes longer to build. When a tool is bad enough that it makes obvious errors regularly, you learn quickly where not to trust it. When a tool is good enough that its errors are infrequent and subtle, the calibration work is harder and the cost of an uncalibrated mistake is higher.

Vella’s term, supervisory engineering, is useful because it names something that was happening without a name. The middle loop framing is useful because it clarifies the structural position of this work. What remains is understanding what makes supervision competent versus incompetent, and confidence calibration is a significant part of that answer. You cannot review code well if you don’t know where to look, and knowing where to look, for AI-generated code, is a skill you build by tracking errors carefully over time.

That is less novel as a concept than it sounds. It is, in essence, the same thing that makes experienced engineers good at code review in general. The difference is that the failure modes are different, the signals that flag uncertainty are absent, and the calibration has to be rebuilt from scratch for each tool in each domain. Nobody yet has the decades of accumulated instinct for this that they have for human-authored code. We are all, at the moment, relatively early in the calibration curve.