· 6 min read ·

The Session Is the Unit of Work

Source: hackernews

Stavros Korokithakis recently published a detailed account of his personal workflow for writing software with LLMs, and the Hacker News thread that followed accumulated 500+ comments and a top score of 519. That response tells you something about where developers are right now: past the novelty phase, actively refining their practice, and genuinely uncertain whether the person sitting next to them has figured out something they haven’t.

The article is worth reading for its pragmatism. What I want to do here is look at the structural reasons behind why mature LLM workflows share certain patterns, reasons that become clear when you think about what a conversation session actually is from a systems perspective.

Context as a Finite, Degrading Resource

A conversation with an LLM is a stateful object with a fixed capacity that degrades in a specific way as it fills. The transformer attention mechanism doesn’t give equal weight to all tokens in a sequence. Recency matters. Instructions given at the start of a long session receive less influence over the model’s output than instructions given recently.

This is the “lost in the middle” phenomenon documented in Liu et al. (2023). They tested retrieval accuracy across a range of models and found consistent degradation for information positioned in the middle of long contexts, even when the context window was nominally sufficient. The GPT-4 technical report and various Claude system card evaluations from Anthropic both acknowledge this property.

For software development the consequence is concrete. You open a session by describing your architecture, constraints, coding standards, and the specific behavior you want. Thirty turns later, the model’s behavior is shaped more by recent exchanges than by the initial framing. The constraints you specified at the start are still technically in context, but their influence has decayed. The model drifts toward patterns it has seen frequently in training rather than patterns you specified early in the session.

This is why starting a new conversation for each distinct task isn’t just psychological hygiene. It ensures that your framing sits at the maximum-influence position in the context, not buried under accumulated exchanges.

The Spec Is the Context Payload

One pattern that emerges in most experienced LLM development workflows is writing the spec before asking for code. Not a vague description but something specific: the function signature, the error cases, the constraints, what the code should NOT do, what assumptions it can make.

Compare two prompts for a rate limiting function:

# Prompt A
Write a rate limiter in Python.

# Prompt B
Write a rate limiter in Python with the following spec:
- Sliding window, per-user, keyed by string user ID
- Limit: 100 requests per 60 seconds
- Thread-safe for use with concurrent.futures.ThreadPoolExecutor
- Returns (bool, int): (allowed, retry_after_seconds)
- retry_after_seconds is 0 when allowed is True
- No external dependencies; stdlib only
- Uses collections.deque for the window, not a list

Prompt B doesn’t just give the model more information; it changes the composition of the context payload at the moment of generation. When you write a detailed spec, you’re doing something the model cannot do reliably on its own: resolving the ambiguity space before code generation begins. The model’s job narrows from “figure out what a rate limiter should do” to “implement this specific contract.”

The spec also functions as a review checklist. If you don’t write it before prompting, you have to reconstruct it from the generated code, which is a harder and less reliable process.

Fresh Sessions and Context Poisoning

There’s a failure mode in long LLM sessions that is worth naming explicitly. You start with clean requirements and iterate. The model generates code with an implicit assumption that turns out to be wrong. You correct it. The model acknowledges the mistake and fixes it. But the corrected code still carries subtle artifacts of the original wrong assumption, because the wrong version is now part of the conversation history, and the model’s output distribution at each step is conditioned on everything that came before.

This isn’t analogous to human anchoring bias. It’s a structural property of how autoregressive generation works. Earlier mistakes modify the probability distribution for subsequent tokens in ways that a single correction message doesn’t fully counteract. The correct version is generated from a context that contains its own incorrect predecessor.

The clean fix is to start a new session with the correct requirement stated clearly from the beginning. The final code from the old session, plus the lesson learned, become the seed for the new one. When I’m building bot features, I’ve adopted a rule: if I’ve corrected the same class of error twice in a session, I start fresh. The session’s context has become too noisy to continue efficiently.

Verifiable Steps and the Testing Loop

The other structural pattern in mature LLM workflows is keeping individual steps small enough to verify immediately. The alternative, requesting large amounts of code in a single pass, creates verification problems that compound in a specific way.

With a small step, the model generates thirty lines. You run the code or read it carefully. It either works or you know specifically why it doesn’t. With a large step, the model generates three hundred lines, something fails, and the failure could be anywhere. You don’t know whether fixing the visible failure will expose another hidden one.

Test-driven prompting addresses this directly. You write the test first (or ask the model to write it from the spec), then ask the model to make the test pass. The test is a machine-checkable contract, so you get binary feedback rather than having to audit the code holistically:

# Step 1: Ask the model to write tests for rate_limiter(user_id, limit, window_seconds)
# covering: normal requests, requests at exact limit, requests over limit,
# concurrent access from multiple threads.

# Step 2: Ask the model to implement rate_limiter so the tests pass.

# Step 3: Run the tests. If they pass, you're done. If not, the failure message
# is specific enough to iterate on without re-reading 200 lines of implementation.

This loop, write test, run test, fix, run test, is structurally different from write code and hope it works, because LLMs fail quietly. The model doesn’t know what it doesn’t know; it generates plausible-looking code at the same confidence level regardless of whether it’s correct. Tests are the mechanism that forces the question.

Aider, a popular terminal-based LLM coding tool, has built this loop into its architecture explicitly. Its --auto-test flag runs your test suite after each model response and feeds failures back to the model automatically. The design assumes that test feedback is more reliable signal than human review on the first pass.

How Problem Decomposition Changes

The most durable effect of working this way is a change in how you decompose problems before touching any code.

In traditional development, you might hold a loose mental model of a feature and refine it as you write. The act of writing clarifies the design. With LLMs, the clarification has to happen before you open the session, because the session’s first tokens shape everything that follows. A fuzzy initial framing means you spend the session fighting drift rather than building.

This forces a useful discipline: specifying before implementing. Not a heavyweight design document, just a precise paragraph stating what this function does, what it doesn’t do, what its inputs and outputs are, and what can go wrong. Writing that paragraph is the actual design work. The code generation that follows is closer to compilation.

The shift doesn’t eliminate the need to understand the code you’re generating. For anything involving shared state, async execution, or protocol-level behavior, understanding remains mandatory. LLMs write wrong code confidently, and catching it requires knowing what correct looks like.

What changes is timing: the understanding is required before the session starts rather than after it ends. That’s a real change in workflow, and in practice a useful one. The forced clarity surfaces ambiguities that would otherwise surface later, in debugging, in code review, or in production.

The workflow improvements from this approach are measurable in day-to-day work. So is the ceiling. Knowing where each one sits is most of what separates a productive LLM workflow from an expensive one.

Was this interesting?