The workflow Stavros describes in his recent post resonates because it’s honest about the friction. Writing software with LLMs is not primarily a tool-selection problem. It’s a decomposition and attention problem. The tooling is largely figured out. The harder question is what you actually need to bring to each session to get useful output rather than plausible-looking code that quietly fails at the edges.
I’ve been using Claude and Cursor heavily in my own work on Ralph, my Discord bot, and the experience has shifted my understanding of where the cognitive work actually lives in a programming session.
The verification tax
When you write code yourself, you understand it as it’s being written. The mental model builds incrementally. When an LLM writes code, you receive a completed artifact that you must reverse-engineer to verify. This is more expensive than it sounds, especially for non-trivial logic.
Consider a function that processes Discord interaction payloads and routes them to handlers. If I write it, I know exactly what edge cases I thought about. If an LLM writes it, I need to read it carefully enough to reconstruct that knowledge, then ask: what did it not think about? That’s a harder question than reviewing code you wrote yourself, because you’re auditing for unknown unknowns.
This is the verification tax. It’s not a fixed cost of using LLMs; it varies based on how precisely you specified the task upfront. A function with explicit constraints, clear error handling requirements, and a defined interface produces output that’s much cheaper to verify than an open-ended request that leaves scope to the model.
# Vague: produces hard-to-verify output
# "Write a function to handle Discord commands"
# Precise: produces verifiable output
# "Write a function that:
# - Takes a discord.Interaction and a command name
# - Looks up the handler in COMMAND_REGISTRY
# - Returns None if not found (caller handles the 404)
# - Never raises; catches and logs all exceptions
# - Returns the handler's return value on success"
The second prompt is almost a spec. The output is testable almost directly from the prompt itself. The discipline of writing prompts this way is basically the discipline of writing good function signatures before implementing them, which experienced engineers already know is worth the upfront effort.
Context as the unit of work
The session, not the file, is the unit of work when coding with LLMs. A fresh context is a clean slate; a stale one is a liability. Models lose coherence over long sessions in ways that are subtle and hard to detect. The output stays grammatically correct and syntactically valid, but it drifts from the problem at hand. Earlier constraints get silently dropped. Variable names shift. The architecture starts to wander.
The discipline that emerges from this, and that Stavros’s post touches on, is knowing when to start over. This runs counter to the instinct to continue from where you left off. Starting a new session with a concise summary of what has been decided is almost always better than dragging a 40-message context forward.
I’ve started keeping a short working document, something like a task-specific context file, that I paste at the start of new sessions. It captures:
- What the component does and what it connects to
- What decisions have already been made and why
- What the current task is, in precise terms
- Any constraints that aren’t obvious from the code
This is not documentation in the usual sense. Documentation is for humans reading the final code. A context anchor is scaffolding for the model you’re currently talking to. It’s disposable, and you should update it aggressively mid-session as decisions solidify.
The Claude Code CLAUDE.md convention is a formalized version of this idea at the project level. A per-task variant of the same pattern is useful for longer implementation sessions.
The decomposition skill
The biggest improvement in LLM-assisted workflow comes from getting better at task decomposition, but in a specific sense: identifying the smallest unit of work that has a verifiable output.
LLMs are excellent at implementing well-specified functions. They’re unreliable at designing systems. The line between these two activities is fuzzier than it seems. “Add caching to this endpoint” sounds like an implementation task but is actually a design task. It requires deciding where the cache lives, what the invalidation strategy is, what happens on cache misses under load, and what the failure mode looks like. Give that prompt without answers to those questions and you get an implementation of someone’s implicit assumptions, not yours.
The workflow that works is: you make the design decisions, you specify the interface, you write the test assertions first if possible, then you let the model fill in the implementation. The model is a very fast implementer who needs a complete spec, not an architect.
# Give the model this contract:
async def get_guild_summary(
guild_id: int,
*,
max_age_seconds: int = 300,
cache: Cache | None = None,
) -> GuildSummary | None:
"""
Returns a summary of recent guild activity.
Uses cache if provided and entry is younger than max_age_seconds.
Returns None if the guild has no recent activity.
Never raises; logs errors and returns None on failure.
"""
...
With that contract, the implementation is nearly mechanical. Without it, the model makes architectural choices you’ll regret during the next refactor.
What LLMs cannot observe
There is a class of bugs that LLMs generate confidently and repeatedly. They involve time, state, and ordering: race conditions, cache invalidation bugs, off-by-one errors in pagination, subtle ordering dependencies between async operations. These aren’t failures of intelligence; they’re failures of observability. The model can’t run your code, can’t watch your database, can’t see the interleaving of your async operations. It reasons statically about dynamic behavior.
The copy-and-patch JIT introduced in CPython 3.13 is a useful analogy: the optimizer can only work with what’s visible at compile time, while actual runtime behavior requires execution to understand. LLMs are in the same position. They pattern-match on static text that represents dynamic behavior, and they’re correspondingly blind to whole categories of runtime bugs.
This doesn’t mean LLMs are useless for these problems. It means you need to serve as the execution environment during review. Run the code. Write integration tests, not just unit tests. Don’t rely on static review of generated code for anything involving shared state, time, or concurrency.
The asymmetry of breadth and depth
LLMs are dramatically better at breadth than depth. Given a problem, a model can quickly generate five plausible approaches, sketch three different data model designs, or surface a dozen edge cases you hadn’t considered. This compresses the early exploration phase of a problem from hours to minutes, and the value is real.
But depth, meaning correctness under adversarial conditions, edge case completeness, performance at scale, and security properties, requires you. The model will produce an implementation that works on the happy path reliably. Whether it works on the paths that matter in production is your problem.
This asymmetry suggests a specific workflow rhythm: use the model heavily in the exploration and scaffolding phase, then shift to a more skeptical, detail-oriented mode during implementation review. The mistake is maintaining the same level of trust throughout a session. The model that did solid work outlining your system architecture is not equally reliable when writing the part that handles malformed input from untrusted users. OWASP’s LLM Top 10 covers the security dimension of this in detail, and most of the risk categories trace back to exactly this over-trust problem.
The long-term question
The real open question is whether the verification tax decreases as models improve, or whether it’s structural. My working hypothesis is that it’s partially structural. The more capable the model, the more ambitious the code it generates, and the harder that code is to verify. The benefit scales with capability, but so does the verification burden.
What shifts is the nature of the work, not the total amount of it. Writing software with LLMs doesn’t reduce the thinking required. It changes what kind of thinking is required: investment moves from implementation toward specification, from writing toward reading, from building toward auditing.
That’s the cognitive shift that posts like Stavros’s are circling around when they describe workflows. The tools are almost beside the point. What matters is developing the habits of mind that make LLM collaboration productive rather than just fast.