· 6 min read ·

The Verification Tax Nobody Warned You About

Source: hackernews

There is a specific kind of tired that comes after a long session working alongside an LLM. It is not the tired of having done a lot of hard thinking. It is the tired of having done a lot of checking. Those are different experiences, and the second one feels worse.

Tom Johnell’s post LLMs Can Be Absolutely Exhausting landed on the front page of Hacker News with 339 points and over 200 comments, which tells you the experience is widely shared. The exhaustion he describes is real, and I think it deserves more precise language than ‘AI is frustrating sometimes.’

The specific mechanism is what I’d call the verification tax: the cognitive overhead imposed by a system that produces output which is fluent, confident, and structurally plausible, but which may be subtly or catastrophically wrong in ways that are not flagged. You cannot skim LLM output the way you skim documentation. You have to read it carefully, hold it against your own model of the problem, cross-check the parts that seem off, and decide whether the generated code is correct or merely looks correct. That process burns mental energy at a rate that compounds over a long session.

Why Verification Is Uniquely Draining Here

Verification is not new to software development. Code review exists. Reading other people’s pull requests is cognitively demanding. But there are properties of LLM output that make its verification harder than most.

First, the confidence calibration is broken. A human colleague who is uncertain about something usually signals that uncertainty: they hedge, they ask, they leave a comment saying ‘I’m not totally sure about this part.’ LLMs do not do this reliably. A model will describe a nonexistent API with the same prose cadence it uses to describe a real one. The hallucination problem is well-documented academically, but the practical consequence is that you cannot develop a reading strategy based on ‘this part sounds confident so it’s probably fine.’ Everything sounds confident. So everything needs to be checked.

Second, errors are often local and semantically subtle rather than syntactically obvious. If a colleague writes code that fails to compile, you notice immediately. LLM-generated code that does the wrong thing at runtime, or handles a specific edge case incorrectly, or uses a deprecated API that still works but will break in the next library version — that requires domain knowledge to catch, and catching it requires active attention rather than passive scanning.

Third, the output volume is high. A single prompt can produce two hundred lines of code. Reviewing two hundred lines of LLM output is not the same as writing two hundred lines yourself, but it is not free either. Multiply that across a full working day and you have done an enormous amount of reading and checking work, most of which is invisible in any retrospective on what you ‘produced.‘

The Verification Paradox

There is a version of this problem that is almost philosophical. If you know enough about a domain to accurately verify LLM output in that domain, you probably could have written the output yourself. The LLM is saving you keystrokes while adding a verification step. Depending on the task, that can be a net positive or a net negative.

For boilerplate, for scaffolding, for code patterns you know well and can verify quickly, the tradeoff is clearly in your favor. For novel problems in unfamiliar territory, the tradeoff gets murkier. You are less equipped to catch errors, which means you need to verify more carefully, which means the verification tax is higher precisely where your baseline productivity is already lower.

This is one reason experienced developers often report more LLM-fatigue than junior developers initially. Juniors may not catch the subtle errors, which means they unknowingly accept them. Experienced developers catch more errors, which means they do more verification work, which is tiring. There is a grim irony in that structure.

Context Drift and Conversation Management

A separate but related source of exhaustion is context management. Long LLM conversations degrade. The model loses track of constraints established early in the session, repeats suggestions you already rejected, or starts solving a slightly different problem than the one you originally described. This is well-studied: attention over long contexts in transformer models is not uniform, and the beginning and end of a context window are attended to more reliably than the middle.

The practical consequence is that you develop a meta-skill around managing conversations: keeping them short, reanchoring context explicitly, knowing when to start a fresh session versus continuing. This meta-skill is real work. It requires you to maintain a model of the model’s current state, which is a second cognitive process running in parallel with the actual task.

Some teams handle this with structured prompting practices: system prompts that restate key constraints, conventions around session length, explicit state serialization between sessions. These are reasonable adaptations, but they are overhead. You are now maintaining tooling for your tooling.

The Trust Calibration Problem

Human-automation interaction research has a concept called automation bias: the tendency to over-rely on automated system output and under-apply independent judgment. The classic studies come from aviation, where pilots would sometimes follow autopilot into error states that a manual pilot would have caught. The effect is real and well-replicated.

LLMs create a variant of this problem. After enough sessions where the output was good, you start reading it less carefully. Then you get burned by a hallucinated function signature or a logic error in generated business rules. So you overcorrect and read everything with maximum scrutiny. Then you find yourself exhausted by the overhead. Then you start skimming again. The calibration oscillates without ever settling, because the model’s error rate is not stable across domains or prompt types in a way that lets you build reliable intuition.

This is not a problem you solve once. It is a background process that runs continuously.

What Actually Helps

I have found a few things that reduce the verification tax without eliminating the benefits.

Smaller, more constrained prompts produce output that is faster to verify. Asking for a single function rather than a module, asking for an outline before asking for prose, asking for pseudocode before asking for implementation — these produce smaller units of output that are easier to hold in working memory while checking.

Test-first workflows shift the verification burden to the test runner rather than your eyes. If you have a clear spec expressed as tests, you can generate implementation against that spec and let the tests do most of the verification. This does not work for everything, but for well-defined algorithmic tasks it materially reduces the cognitive load of checking.

Explicit uncertainty prompting helps inconsistently. Asking the model to flag parts it is less certain about, or to list assumptions it is making, produces more useful output sometimes. The technique is not reliable enough to replace verification, but it surfaces issues you might otherwise have to discover yourself.

Knowing when to step away from the model entirely is underrated. For problems that require extended, original reasoning, for designs that need to hold together across many constraints, for anything where you need to think deeply rather than produce text quickly, the LLM often adds more noise than signal. The fatigue of LLM work is partly the fatigue of pushing a tool into tasks it is not suited for, and the better calibration is sometimes to close the chat window.

The Longer View

None of this is an argument against using LLMs. The productivity gains on appropriate tasks are real. But the discourse around AI-assisted development has been dominated by announcements and demos, and the honest accounting of costs has lagged behind. The verification tax is a real cost. Context management is a real skill that takes time to develop. Automation bias is a real risk.

The exhaustion that Johnell describes, and that hundreds of HN commenters recognized in themselves, is not a bug you can patch away. It is a structural feature of working with systems that generate plausible output without reliable uncertainty signaling. Acknowledging that clearly is the starting point for managing it well.

Was this interesting?