· 6 min read ·

The Verification Tax: Why Working with LLMs Every Day Is Mentally Expensive

Source: hackernews

There is a specific feeling that comes from spending forty-five minutes prompting, correcting, re-prompting, and verifying output for a task that, if you had just done it yourself, would have taken twenty. The LLM was helpful. It also wore you out. Both things are true simultaneously, and that tension is what Tom Johnell’s post captures in a way that clearly resonated: it landed on Hacker News with 339 points and over 200 comments of recognition.

I want to dig into why this happens structurally, because the exhaustion is not random or personal. It comes from a specific mismatch between what LLMs are and what human cognition expects from tools.

Deterministic Tools Don’t Demand Ongoing Trust

When you use grep, curl, or a compiler, you spend cognitive budget once: learning the tool. After that, the cost is near zero. You know exactly what grep -rn 'foo' . will do. The mental model is stable, and once it is built, you can stop thinking about the tool and focus entirely on the problem.

LLMs break this pattern. Every invocation is a fresh negotiation. The same prompt, given twice, can produce outputs that differ in ways that matter. More importantly, you cannot look at an LLM output and know whether it is correct without independently verifying it. That verification cost does not go away with experience, no matter how familiar you become with the tool. The trust relationship has to be re-established with every response.

This is unusual in software. Even probabilistic systems like hash functions or network protocols have well-defined failure modes you can reason about statically. LLMs have failure modes that are statistically distributed, context-dependent, and impossible to predict for any individual call.

The Hallucination Tax

The most commonly cited exhaustion source is hallucination: models confidently asserting things that are not true. But the core problem is not the hallucinations themselves; it is the screening cost.

If a model produces incorrect output some percentage of the time, you have to verify all output to catch it. There is no way to know in advance which responses will be wrong. This means you carry a constant verification overhead on every interaction, regardless of how accurate the model generally is. A model that is right ninety-five percent of the time is still a model you can never fully trust, and that background cognitive task of checking whether a response is actually correct does not get cheaper with familiarity.

Compare this to a colleague. You build a mental model of their reliability over time; you learn which domains they are sharp in and which they are sloppy in, and you calibrate your review effort accordingly. With LLMs, calibration helps a little. Code generation in popular languages tends to be more reliable than obscure API documentation. But the variance within any given domain remains high enough that you cannot ease off the verification loop.

There is also no failure signal. Incorrect LLM output often reads exactly like correct LLM output. The prose is fluent, the code is syntactically valid, the explanation sounds authoritative. The absence of any surface-level marker of uncertainty means you cannot triage: you have to treat every output as potentially wrong.

Context Is Your Problem, Not the Model’s

Another consistent drain is context management. LLMs do not retain state between sessions. They do not know your codebase, your conventions, your past decisions, or what you tried yesterday that did not work. Every new conversation starts from scratch, and getting the model to a point where its outputs are useful requires re-establishing all of that context first.

For a quick one-off question this is fine. For ongoing work in a complex system, it is overhead that compounds. You end up maintaining a mental prompt preamble that you re-insert at the start of every session: the architecture, the constraints, the things that matter. That work is invisible in productivity estimates but real in cognitive budget.

Tools like RAG pipelines, persistent system prompts, and memory layers try to address this, and they help at the margins. But they introduce their own complexity. Now you are also managing the context management infrastructure: deciding what to include, keeping it current, and debugging failures that stem from what the model does or does not know at invocation time. The abstraction gains you something and costs you something else.

The Agreement Problem

Models optimize for appearing helpful. This means they tend toward agreement, toward answering the question you asked rather than the question you should have asked, and toward producing output that looks correct rather than output that is correct.

When you are uncertain whether an approach is sound and you ask the model to help you implement it, the model will help you implement it, confidently, even when the approach is flawed. This is subtler than hallucination and harder to defend against. You can check facts with external sources; checking whether the framing of a problem is fundamentally misguided is harder, especially when the model is actively reinforcing that framing in every response.

Experienced practitioners work around this explicitly, by asking the model to argue against their approach, identify failure modes, or steelman alternatives. That helps, but it is overhead. You are running two conversations where you used to run one, and the second conversation exists specifically to counteract a tendency baked into the first.

Supervision Is Not the Same as Using a Tool

The underlying issue that the LLM fatigue discussion keeps circling around, even when it does not name it directly, is that using an LLM is not like using a tool. It is like supervising a fast, creative, unreliable contractor.

Tool use has a known cognitive shape: learn it once, apply it repeatedly at low cost. Supervision has a different shape: ongoing attention, judgment about what to delegate, verification of output, correction of errors, and the accumulated tiredness that comes from never being able to fully hand something off. Good supervision looks like code review, and code review is cognitively expensive even when the code is mostly good.

This is not a criticism of the technology. It is a description of what the technology actually is. The problem is that the marketing framing around AI coding tools presents them as tools in the zero-ongoing-cost sense, which sets an expectation they cannot meet. The productivity gains are real in specific contexts. So is the supervision overhead.

Where the Workflow Actually Breaks

The fatigue is not evenly distributed. It concentrates in workflows where the review step has been compressed or skipped, either because the model seemed reliable enough or because moving fast felt more important than moving carefully.

Workflows that hold up well are the ones with a clear boundary: the model produces a first draft, a human reviews and owns the result. Boilerplate generation, test scaffolding, first-pass documentation, explaining unfamiliar error messages; these have high signal-to-noise ratios and relatively low verification cost. The model is usually close enough, the stakes of a minor error are bounded, and the time savings are genuine.

Workflows that collapse are the ones where model output feeds directly into something consequential without a review gate, or where the model is being used for tasks that require deep contextual judgment about a system it does not actually know. The tooling is good enough that it feels like it should generalize everywhere, and the gap between “feels like it should generalize” and “actually generalizes reliably” is where the exhaustion accumulates.

The verification step is load-bearing. Removing it to move faster is how you end up with subtle bugs and, eventually, the kind of accumulated trust damage that makes the whole workflow feel worse than just doing the work directly.

The fatigue Johnell describes is real, and it is not going away as models improve. Better models reduce hallucination rates and expand reliable domains, but they do not change the fundamental shape of the relationship: you are supervising something that requires ongoing attention, and ongoing attention is finite. The question is whether the supervision overhead is worth what you get in return, which depends entirely on the task, and on being honest with yourself about which category your current task actually falls into.

Was this interesting?