· 6 min read ·

The Verification Tax: What LLM-Assisted Development Actually Costs in Practice

Source: hackernews

Stavros Korokithakis published a detailed personal account of how he actually uses LLMs to write software, and it hit 500 points on Hacker News within hours. The engagement makes sense. Most writing on this topic falls into one of two unhelpful camps: breathless enthusiasm from people who have spent a week with Copilot, or blanket dismissal from people who tried it once, got a hallucination, and called it useless. A careful, honest workflow post from a working developer is rarer than it should be.

Reading it prompted me to think more carefully about my own patterns over the past year or so, and the thing I keep coming back to is not which tool to use or how to phrase prompts. It is the verification tax.

What the verification tax is

When you write code yourself, you understand it as you write it. The mental model builds incrementally. When an LLM writes code, you receive a finished artifact that you must now evaluate, understand, and decide to accept or reject. That evaluation is not free. For code you would have written quickly and confidently yourself, the time spent reading and validating the generated output can exceed the time you would have spent writing it in the first place.

This sounds like an argument against using LLMs, but it is not. It is an argument for being precise about where they deliver a net benefit, which is a different question than whether they are impressive.

The verification tax is not constant across code types. It scales with:

  • How unfamiliar you are with the domain
  • How much the generated code diverges from patterns you already know
  • How much the code interacts with stateful external systems
  • How performance-critical or security-sensitive the code is

Boilerplate that follows known patterns, parsing logic for well-specified formats, test fixtures, one-off scripts, documentation strings, migration files: these carry a low verification tax. The generated code is either obviously right or obviously wrong, and the cost of a miss is low. LLMs are genuinely great at this work.

Core business logic, concurrency primitives, protocol implementations, anything touching authentication or authorization: these carry a high verification tax. The generated code may look plausible and still be subtly wrong in ways that only surface in production. The code is harder to read critically because it is written in your codebase’s idiom but without your codebase’s history.

Where this shows up in practice

I write a lot of Discord bots. The surface area is mostly event handlers, some state management, a fair amount of API glue code. LLMs are useful for the glue and the boilerplate around the handlers, but I write the actual logic myself. Not because the LLM could not produce something that passes a quick read, but because the event sequencing in Discord’s gateway protocol has edge cases that only become apparent after you have shipped a few bugs. The LLM does not have that scar tissue. I do. Handing off that logic means I then have to reconstruct my intuition for it from the generated code, which is slower than just writing it.

On the systems side, the picture is similar. If I need a memory-mapped ring buffer for a toy project, I will often start with an LLM-generated skeleton, read it carefully, and rewrite the parts that look like generic solutions rather than solutions shaped to the actual constraints. What I am doing in that case is using the LLM as a fast first draft, not as a complete answer. The value is that I do not have to type out the boilerplate around the interesting part.

That distinction, first draft versus complete answer, is where most workflow posts undersell the nuance.

Context management is the actual skill

Tools like Aider, Cursor, and Claude Code all handle the mechanical parts of LLM integration with your codebase. The differentiator between a productive and an unproductive session is not which tool you are using; it is how much relevant context you have given the model before asking it to produce anything.

Aider has an explicit /add command for putting files into context. Cursor has @ references. Claude Code reads the filesystem directly within a working directory. The interface differs; the underlying principle does not. A model with good context produces code that fits. A model with poor context produces code that is generically correct but locally wrong, using the wrong abstractions, duplicating things that already exist, naming things inconsistently.

The failure mode I see most often, including in my own work, is asking the LLM to write something before explaining enough about the surrounding system. The output looks fine, the verification pass misses the local incorrectness because you are reading for logic rather than for fit, and then you find the problem three sessions later when you are working on something adjacent.

The fix is tedious but reliable: before asking for code, describe the constraints. Not the output you want, the constraints on the solution. Where this code lives, what it calls, what calls it, what the existing conventions are. This is not natural for people who are used to writing code directly, because when you write code yourself the constraints are implicit in your head. Making them explicit for the LLM also makes them explicit for you, which occasionally reveals that you did not have a clear enough picture to write good code anyway.

The role of tests

Tests shift the verification tax calculation significantly. If you have a test suite that covers the behavior you care about, you can accept LLM-generated code with much less manual review. Run the tests, look at what fails, iterate. The tests become the specification that constrains the LLM’s solution space.

The catch is that LLMs will also generate tests, and LLM-generated tests often test the implementation rather than the behavior. They call the functions with the inputs the implementation handles well and assert on outputs the implementation produces. They do not test the contract; they test the current artifact. If you let the LLM write both the code and the tests, you end up with a self-consistent system that may not be correct.

The workflow that actually works is: write the tests yourself (or at least review them with genuine attention), then use the LLM to make them pass. This is not a novel insight; it is basically test-driven development with an LLM as the implementation step. But it is worth stating explicitly because the temptation to let the LLM handle both directions is strong, especially when you are in a hurry.

Knowing when to close the tab

The thread on Hacker News predictably surfaced the usual range of experiences, from people who report 10x productivity to people who find LLMs more trouble than they are worth. The variance is real, and I do not think it is mostly about skill level or prompt quality. It is about fit between the work and the tool.

For greenfield code in a domain the LLM knows well, the tool is excellent. For incremental work in a mature, idiosyncratic codebase, the overhead of establishing context and validating output against institutional knowledge is high. For exploratory work where you are not sure what you want to build, the LLM can be either a collaborator or a distraction depending on how well you can evaluate its suggestions.

The developers who get the most out of LLM-assisted workflows tend to have a clear sense of which bucket they are in at any given moment. They use the tools heavily for the first category, selectively for the second, and sparingly or not at all for the third. The ones who are disappointed tend to apply the tool uniformly regardless of fit, get inconsistent results, and conclude that the tool is unreliable.

It is not unreliable. It is just not a universal substitute for knowing what you are doing.

Was this interesting?