Plausible Is Not the Same as Correct

There’s a failure mode I keep running into when using AI coding tools, and I’ve started to think it’s the central failure mode: the code looks exactly like what I wanted, compiles cleanly, passes a quick read-through, and is subtly wrong in a way that only surfaces later.

This is the argument at the heart of a sharp piece making the rounds on Lobsters. LLMs don’t reason about correctness. They generate the most statistically plausible token sequence given your prompt. For common patterns — CRUD routes, sorting algorithms, string manipulation — plausible and correct overlap enough that you barely notice the difference. But at the edges, they diverge, and the model has no reliable way to know it has crossed that line.

The Confidence Problem

What makes this genuinely dangerous isn’t the errors. It’s the presentation. A model that was uncertain would hedge, produce garbage output, or refuse. Instead it produces clean, idiomatic, well-commented code with the same confidence it would use to write hello world. There’s no signal in the output that tells you which category you’re in.

I’ve hit this building Discord bots. I’ll ask for help with some edge case in the Discord.js event lifecycle, and I’ll get back something that looks completely reasonable — handles the right events, uses the right API shapes — but misunderstands something subtle about when certain caches are populated, or gets the order of operations wrong in a way that only fails under specific guild conditions. The code isn’t random noise. It’s a plausible interpolation from training data that doesn’t quite match my actual situation.

What This Changes About How You Should Work

I don’t think the answer is to stop using LLMs for code. That’s not where I’ve landed. But I’ve changed how I treat the output:

Assume correctness is unverified, not confirmed. Getting code that compiles is table stakes, not a green light.
Write the tests yourself. If you let the model write both the implementation and the tests, you’re letting it grade its own work. It will write tests that confirm what it already wrote.
Be more suspicious on novel or edge-case paths. Boilerplate has been pattern-matched a million times. Your specific edge case probably hasn’t.
Read the code before you run it. This sounds obvious but the speed of AI-assisted coding creates pressure to skip this step.

The Broader Point

I think developers underestimate how much of their mental model of AI coding tools is based on the common case. The common case is fine. The uncommon case is where you discover the model was never reasoning at all — it was completing a pattern that happened to look like reasoning.

The article puts it plainly: the model doesn’t know what correct means in your domain. It knows what correct-looking code looks like in the aggregate. Those are not the same thing, and closing the gap between them is still your job.