· 7 min read ·

The Verification Tax: What LLM-Assisted Development Actually Costs

Source: hackernews

There is a post by Tom Johnell making the rounds, titled “LLMs can be absolutely exhausting”, that picked up around 250 points on Hacker News and generated 167 comments. The piece landed because it names something that productivity-focused LLM discourse tends to skip: the hidden costs that accrue on the other side of the ledger, slowly, in ways that don’t show up in benchmark graphs.

I’ve been building Discord bots and doing systems work with heavy LLM assistance for long enough to have formed some opinions here. The costs are real, they’re cumulative, and they’re worth naming precisely.

The Verification Tax

Every piece of LLM-generated code comes with an implicit obligation to verify it. This seems obvious, but the size of that obligation is easy to underestimate because the code looks right. LLMs produce locally coherent output: the syntax is clean, the variable names are reasonable, the structure follows familiar patterns. The errors live in the gaps between the surface and the semantics.

Consider something I hit repeatedly in bot development. Discord’s Gateway protocol requires careful handling of the identify rate limit, which allows at most one identify payload per 5 seconds, combined with the heartbeat cycle that must not be disrupted during reconnection. LLMs generate reconnection logic that looks structurally correct, handles the obvious cases, and fails exactly when the network degrades in ways the happy path doesn’t exercise. The code passes inspection if you’re reading for correctness in isolation. It fails when you need it to work.

Verifying this kind of output requires holding two things in mind simultaneously: what the code is doing, and whether what it’s doing is correct for the context it will run in. That’s not lightweight review. For application-layer boilerplate, the tax is small. For anything touching concurrency, protocol behavior, or state management under failure conditions, the tax is substantial, and it accrues on every single generation.

The framing that LLMs save time is true in a narrow sense. The first draft appears faster. What often doesn’t get counted is the time spent on verification, the time spent debugging when verification fails, and the time spent recovering context when a session ends before the debugging is finished.

Context Fragmentation Is a Structural Problem

Current models have context windows that sound large: 128k tokens for GPT-4o, 200k for Claude, larger with some providers. In practice, these limits constrain real work in ways that force awkward choices.

A non-trivial debugging session fills context faster than it seems like it should. You paste the relevant files, the error output, the relevant documentation section, the previous failed attempts. By the time you’ve established enough shared understanding with the model to make progress on a hard problem, you’ve consumed a meaningful fraction of the window. When that session ends, for any reason, the context is gone.

Starting fresh in a new session means rebuilding that context from scratch. You repaste the files, re-explain the constraints, re-establish which approaches have already been tried. This isn’t a small overhead. For a problem that spans multiple sessions because it’s genuinely hard, you might spend more time reconstructing context than making progress.

Tools like Aider address part of this with repo maps: compressed representations of codebase structure built from tree-sitter parse trees, giving the model navigational context without requiring you to paste full files. This helps with orientation but doesn’t solve the more fundamental problem, which is that the model’s understanding of your specific problem state, the failed approaches, the constraints that ruled them out, the partial progress, doesn’t persist.

The session boundary is also where subtle losses happen that are hard to notice. A multi-day debugging effort that spans several sessions tends to drift: the later sessions don’t fully benefit from the reasoning in the earlier ones, and you can end up covering the same ground twice without realizing it because the model doesn’t remember and you don’t have a clean record of what was established.

The False Economy of Saved Time

The headline metric for LLM-assisted development is time saved. GitHub’s research measured a 55% reduction in task completion time for specific exercises. Those numbers hold for a real class of work: bounded tasks with clear specifications in well-trodden domains. They’re less informative about what happens when the task is harder.

Hallucinations in systems programming contexts have a specific failure signature. The model generates a plausible implementation that rests on a false premise about what an API guarantees, what a function’s semantics are, or what invariants a data structure maintains. The code compiles, tests may pass, and the failure materializes in production under conditions that tests didn’t cover.

Debugging this kind of issue is expensive because you have to first establish that the implementation is wrong, then understand why it’s wrong, then understand what correct looks like, then implement correct. The LLM is helpful for parts of this process and a liability for others, specifically the parts where it confidently suggests that the original approach was actually fine, with a minor tweak.

I’ve spent longer debugging LLM-generated code than I would have spent writing equivalent code from scratch. Not every time, not even most of the time, but often enough that the time-saved framing requires qualification. The cases where the savings evaporate are precisely the cases where the work was hardest, which means they’re not edge cases you can safely ignore.

Confident and Wrong Is Harder Than Uncertain and Wrong

The emotional dimension of Johnell’s piece is the part that’s easiest to dismiss and the part that holds up most on reflection. Working with a tool that is confidently wrong is different from working with a tool that is uncertain, and the difference is cognitively taxing in a way that’s difficult to describe without sounding like you’re complaining about something trivial.

When a junior developer writes incorrect code, they usually know something is uncertain. They hedge, they ask, they flag the parts they’re not sure about. You calibrate your review accordingly. LLMs don’t hedge except as a stylistic choice. The model presents a fix for a race condition with the same tone as it presents a correct string split. Nothing in the presentation signals that this particular suggestion is outside the model’s reliable knowledge.

This means the cognitive burden of calibration falls entirely on you. You have to maintain your own model of where the LLM is reliable and where it isn’t, apply that model to every output, and never let the presentation of confidence do any of that work for you. Over a long session, or across many sessions over many weeks, this is wearing in the specific way that sustained vigilance is wearing.

There’s also a subtler effect. Correcting confidently-stated errors repeatedly creates a mild form of friction that accumulates. The model doesn’t update. It will make the same category of error on the next generation, with the same confidence. You know this going in, but knowing it doesn’t make the fifth instance of the same kind of mistake any less tedious to correct.

This isn’t an argument against using LLMs. It’s an argument for being honest about what working with them requires.

Prompt Iteration Is Real Work

A workflow detail that rarely makes it into LLM productivity discussions is the cost of prompt iteration: the process of trying a prompt, getting an output that misses in a specific way, adjusting the prompt to address that failure, getting an output that misses differently, adjusting again.

For simple tasks, iteration is fast. For harder tasks, especially those involving non-deterministic behavior in concurrent code or protocol implementations, the iteration loop can run many cycles before converging on something worth keeping. Each cycle requires reading the new output in full, comparing it to what you wanted, identifying where it missed, and formulating the adjustment. This is genuine cognitive work, and it compounds when the task is in a domain where the output is hard to evaluate quickly.

Non-determinism compounds this further. Running the same prompt twice against the same model doesn’t guarantee the same output. In practice the variance is usually small, but in systems programming contexts even small semantic variance matters. You might iterate toward a working solution, then run the same prompt to regenerate a cleaned-up version and get something subtly different. This is less common than it sounds in practice, but it happens often enough to require validation after every significant regeneration.

What This Means in Practice

None of this is an argument for abandoning LLM-assisted development. The productivity gains in the right categories are real and the tools are improving. It’s an argument for accounting honestly for the costs that tend to get omitted from the productivity story.

The verification tax is real and scales with task complexity. Context fragmentation is a structural limitation that current tools address only partially. The time savings are genuine but unevenly distributed, with the most expensive cases being exactly the cases where the savings evaporate. The emotional overhead of sustained vigilance against confident-but-wrong output is genuine and accumulates.

Johnell’s piece captures the texture of this experience accurately. The HN discussion around it reflected a split between people who recognized the description immediately and people who argued it reflects poor prompting technique. The second camp is applying a solution to the wrong level. Prompt technique helps at the margins. The costs described are structural properties of what these tools are, not artifacts of using them wrong.

The workflow that works, for me, involves explicit context budgets before starting a session, fresh conversations for bounded tasks rather than long threads that accumulate wrong turns, and a maintained list of domains where I’ve found the models unreliable enough to warrant extra scrutiny. This doesn’t eliminate the costs. It keeps them manageable.

The less exciting version of the LLM productivity story is that these tools make certain kinds of work faster while adding a category of overhead that didn’t exist before. The net is often positive. It isn’t always positive. That distinction matters when you’re choosing where to apply the tools and how much to trust what they produce.

Was this interesting?