Coding Agents Under Pressure: How Session Length Erodes Decision Quality
Source: simonwillison
Simon Willison’s guide to how coding agents work describes the tool loop cleanly: an LLM receives tools, runs in a loop, appends each tool result to the context window, and repeats until done. This is accurate and the mechanics are genuinely simple. What the mechanical description does not fully surface is that the loop’s behavior changes substantially as the session grows. A coding agent at turn 5 is reasoning from a different quality of working state than the same agent at turn 50, even on an identical task. Understanding why that is, and how it shapes session design, is where the practical engineering work lives.
Context Growth Is Not Neutral
Every tool call and every result appends to the context window. By the end of a substantive coding session, the window contains a layered record: the original task, every file read, every search result, every edit confirmation, every test output, every error message, every recovery attempt. None of that earlier content is removed unless the agent framework explicitly compacts it.
A realistic context budget for a multi-file bug fix looks like this:
- System prompt with tool definitions and project conventions: ~3,000 tokens
- Initial task description: ~200 tokens
- Five file reads averaging 300 lines each: ~15,000 tokens
- Eight search results from grep calls: ~2,400 tokens
- Four edit confirmations: ~200 tokens
- Two test suite runs: ~3,000 tokens
- Model reasoning between steps: ~4,000 tokens
That puts a focused five-file session at roughly 28,000 tokens. Extend the task to ten files with a few false starts and re-reads, and 80,000 to 100,000 tokens is realistic. Against Claude’s 200k context limit or GPT-4o’s 128k, this is not catastrophic. But context length is not the only variable that matters.
The Attention Distribution Problem
Research into transformer attention over long contexts consistently finds that recall is not uniform across the sequence. The phenomenon, described in the “lost in the middle” paper by Liu et al., demonstrates that retrieval of information positioned in the middle of a long context is less reliable than retrieval of information near the beginning or end. For coding agents, this has a concrete implication: early file reads are the most vulnerable.
The first file an agent reads tends to be the most important one for the task. It is often the file containing the bug, the module being refactored, or the interface being extended. By the time the agent has read five more files, run two test cycles, and handled an edit failure, that first file read is buried in the middle of a long context sequence. The agent may still reason correctly about it most of the time. But the probability of misremembering a function signature, confusing a variable name, or losing track of a constraint stated early in the session increases with session length.
This explains a pattern that looks like agent inefficiency but is actually a form of compensatory behavior: re-reading files the agent already loaded. A well-tuned coding agent will reload a file before making a second edit to it, even if it loaded the file forty turns ago. It is not forgetting; it is refreshing a read that may have degraded in effective recall. The redundant read is cheaper than the edit failure that would result from acting on a stale internal representation.
How Long Sessions Compound Errors
The attention problem combines with another dynamic: error accumulation. Early in a session, each turn has a small probability of producing an incorrect assumption or a misjudged edit. If the agent catches the error immediately, it corrects it and the session continues cleanly. If the error is not immediately visible, it propagates. A wrong assumption about a function’s return type in turn 3 may not surface as a visible failure until turn 22, by which point the agent has made several decisions downstream of the incorrect assumption.
At that point, the agent faces a recovery problem in a context that is already 20,000 tokens deep. It has to reason backward through its own history to find the source of the problem, which means attending to the earliest portions of the context, which are the hardest to retrieve reliably. Sessions that go wrong in the middle of a long run tend to spiral rather than self-correct cleanly.
This is why the failure modes of long coding sessions are qualitatively different from the failure modes of short ones. Short sessions fail fast and obviously. Long sessions fail slowly through accumulated drift, which produces confused output that is harder to diagnose from the result alone.
What the Engineering Responses Look Like
The industry response to context pressure has converged on three approaches, each with a different cost-quality tradeoff.
Context compaction collapses prior messages into a summary. Claude Code does this automatically when the context approaches its limit, generating a condensed representation of what the agent has done and what it knows. The compaction preserves semantic content better than raw truncation. The cost is that summaries lose details that were not recognized as important at summarization time. A specific error message from turn 8, summarized away because it appeared to be resolved, may turn out to matter in turn 35. The summary does not contain it.
Scoped subagents avoid the problem by giving each subtask its own fresh context window. Claude Code’s Task tool delegates work to a new agent instance; the subagent completes its task and returns only the final result to the orchestrating agent. The orchestrator accumulates task outcomes rather than full tool call transcripts. Each subagent operates within a context where its early reads remain accessible throughout, because the task is scoped tightly enough that those reads never get buried.
The cost of subagents is coordination overhead. Each delegation involves latency and additional API calls. The subagent’s output has to be informative enough for the orchestrator to reason about, without being so verbose that it generates its own context pressure. Getting that balance right requires careful tool design.
Session decomposition is the simplest response: break large tasks into smaller sessions, each with a clear completion criterion, and persist intermediate results to the filesystem or git rather than to the context window. A session that ends with a commit, a test passing, or a clear checkpoint output has a natural restart point. The next session starts with a fresh context but can reconstruct relevant state by reading committed code and running tests.
This maps closely to how experienced developers manage complex work: not as one long continuous session, but as a series of focused steps, each self-contained. The git history becomes the memory that persists across sessions; the context window is the working memory for a single focused increment.
Designing for Long-Session Reliability
If you are building tools or workflows that will run inside long coding agent sessions, a few specific design choices make sessions more robust.
Tool outputs should be concise and structured. A tool that returns 5,000 tokens of raw shell output buries information the agent needs to reconstruct quickly later. A tool that returns a structured 200-token summary of the same information is faster to re-read and less likely to degrade in recall. The Anthropic documentation on tool use does not prescribe output formats, which is why tool verbosity varies enormously across agent implementations.
Provide orientation context upfront rather than forcing discovery. A CLAUDE.md file that states the project structure, key file locations, and testing conventions lets the agent skip the first 15 turns of exploration. Those 15 turns would have been the freshest content in the context window, crowded out by orientation reads that could have been replaced by a single compact document. The orientation cost is paid once at setup time; it pays dividends on every session against that codebase.
Design tasks to complete in fewer than 30 tool calls when possible. This is not always achievable, but it is worth treating as a design goal. A task that requires 30 tool calls has a realistic context budget of around 50,000 to 70,000 tokens. A task that requires 80 tool calls is operating in territory where attention degradation and error accumulation become significant reliability risks. Breaking the second type of task into three sequentially executed sessions, each with a concrete deliverable, tends to produce better results than running it as a single long session.
Use re-reads as a reliability mechanism for files the agent will edit multiple times. If your agent framework allows it, emit an explicit re-read of any file before a second edit to it. The cost is one tool call. The benefit is that the edit operates from a fresh, high-quality read rather than from a potentially degraded middle-of-context representation.
The Session as a Design Constraint
Simon Willison’s mechanical description of the agent loop is correct and worth understanding thoroughly. The loop is simple. What is not simple is the dynamics of that loop over time, under pressure from an accumulating context. Treating the session itself as a design constraint, one that should be scoped, checkpointed, and managed with the same care as any other stateful system, is what separates reliably useful coding agent deployments from ones that work in demos and fail unpredictably in production.
The context window is not just a technical parameter to maximize. It is a resource with a quality curve: fresh content at both ends, degraded content in the middle, and a total capacity that sounds generous until you account for what a real task actually generates. Designing around that curve, rather than ignoring it, is the practical work behind agentic engineering.