· 7 min read ·

Context Window as State: What Happens Inside a Coding Agent Run

Source: simonwillison

The fundamental architecture of a coding agent is simple: an LLM in a loop, calling tools, observing results, calling more tools. Simon Willison’s guide to coding agents lays this out clearly. What the architecture description elides is what the context window contains at each step, and why that structure shapes everything from individual tool design to which tasks agents can reliably complete.

What the Context Actually Contains

When a coding agent runs, the context window accumulates a specific kind of conversation. It starts with a system prompt describing available tools and behavioral guidelines. Then comes the user’s task. Then the model’s first tool call. Then the result. Then the next tool call. Then the result. The chain grows with every turn.

By the time an agent has read three files, run two shell commands, and written a patch, the context looks something like this:

[system prompt: ~3,000 tokens]
[user message: "Fix the bug in the auth module"]
[assistant: calls Read on auth/session.ts]
[tool result: 400 lines of TypeScript, ~800 tokens]
[assistant: calls Read on auth/middleware.ts]
[tool result: 200 lines, ~400 tokens]
[assistant: calls Grep for 'expireSession']
[tool result: 12 matching lines across 5 files, ~150 tokens]
[assistant: calls Edit on auth/session.ts]
[tool result: edit confirmation, ~50 tokens]
[assistant: calls Bash to run tests]
[tool result: test output, ~500 tokens]

That is roughly 5,000 tokens for a moderately simple bug fix. A task touching eight files, running several test cycles, and requiring correction rounds could hit 25,000-40,000 tokens. Modern context windows handle that range without complaint. Claude 3.5 Sonnet and GPT-4o both support 200k token contexts. But context pressure shapes agent behavior well before the hard limit appears.

Why Tool Output Format Is a First-Class Design Decision

Every tool result occupies context tokens that must remain useful as the conversation grows. A tool returning noisy or redundant output burns context space that could hold the next file read or test result.

This is why purpose-built coding agents provide structured tools rather than just wrapping bash. A Read tool returns file contents with line numbers and truncation markers. A Grep tool returns matching lines with filename and line number in a compact format. A Glob tool returns file paths sorted by modification time. If you replaced those with shell equivalents:

find . -name "*.ts" | head -20
grep -rn "expireSession" --include="*.ts"
cat auth/session.ts

You get the same information, but embedded in shell output formatting, potential stderr noise, and no control over truncation. The model still gets what it needs, but the context is noisier. Over a 40-turn session, that noise accumulates. Clean, structured output allows the model to reference earlier results precisely; a blob of shell output requires re-parsing every time the model needs a detail from it.

The Claude Code Edit tool illustrates this principle concretely. It takes old_string/new_string pairs rather than line numbers. If the model read a file in turn 5 and decides to edit it in turn 15, line numbers may have shifted due to intermediate edits in that same file. String matching against the old content is more robust than line-indexed replacement, and the design follows directly from how the context window grows during a session. The tool design is downstream of the context mechanics.

Context Pressure and the Limits It Imposes

Context pressure is not purely a token count problem. The “lost in the middle” paper from Stanford and UC Berkeley demonstrated that LLM performance on information recall degrades significantly for content placed in the middle of long contexts compared to the beginning or end. Early turns in a conversation are more reliably recalled than middle turns, even when all of it technically fits within the context window.

For a coding agent, this creates a practical constraint: as a session extends, early file reads become less accessible to the model’s reasoning. Long agent sessions tend to include more seemingly redundant reads as the model re-fetches content it already retrieved earlier. The context window fills, but the useful working memory shrinks relative to the total tokens consumed.

The engineering response to this is task decomposition. Rather than running one agent for a complex multi-file refactor, you run coordinated sub-agents, each with a focused task and a fresh context window. Claude Code’s sub-agent support allows an orchestrating agent to delegate scoped tasks to sub-agents, each starting clean and returning a summarized result. The orchestrator accumulates task outcomes rather than raw tool call transcripts. This keeps the orchestrator’s context from filling with the noise of intermediate steps, and each sub-agent operates within a context window where its early reads remain accessible throughout its task.

A 200k token context sounds vast until you do the arithmetic. A complex codebase task reading 20 files averaging 300 lines each consumes roughly 120,000 tokens in file content alone, before accounting for the system prompt, user messages, tool calls, shell output, and the model’s own reasoning turns.

Parallel Tool Calls and What They Actually Change

Most coding agents support parallel tool calls: the model emits multiple tool calls in a single turn and receives all results back simultaneously. This is usually framed as a latency optimization. The context-quality benefit is at least as significant.

When an agent reads three files sequentially, it adds three rounds of model reasoning between file reads. Each reasoning step sees only the files read so far. Parallel reads put all three file contents into the same context batch, so the model can reason across all of them jointly in the next turn. For cross-file analysis tasks, such as finding all callers of a function or understanding an interface spread across multiple modules, parallel reads improve reasoning quality by reducing the number of context boundaries between related information.

The Anthropic tool use documentation covers the mechanics. The context-quality angle is less often noted, but for codebase tasks requiring cross-file understanding it matters more than the latency savings.

Shell Access and the Execution Feedback Loop

The most capable coding agents distinguish themselves from code-editing tools by being able to run code, not just write it. Shell access enables a feedback loop that static code editing cannot provide: write code, run tests, observe failures, revise, run again.

This loop is what allows agents to make progress on tasks with non-obvious correctness criteria. Writing syntactically valid code is tractable from static analysis. Writing code that passes a specific test suite requires observing execution outcomes and adjusting from them. SWE-bench results, which measure AI agents solving real GitHub issues, consistently show that agents with test execution capability substantially outperform those limited to file modification alone.

Most coding agents handle shell permissions carefully for this reason. Claude Code’s permission system distinguishes read-only tools (Glob, Grep, Read), file mutation tools (Edit, Write), and execution (Bash). The distinction is not only a safety boundary; it communicates something to the model about the consequences of its own actions. A file edit can be reviewed and reverted. An arbitrary shell command has broader and less predictable effects. Making that distinction explicit in the tool schema means the model can incorporate it into its planning, preferring read-only exploration before committing to mutations.

The permission model also shapes user trust. An agent that asks before running commands with broad effects is one users are more willing to run on a non-trivial codebase. Agents with unconstrained shell access and no confirmation prompts handle the same set of tasks but impose more oversight burden on the user, which limits where they get deployed.

What This Means in Practice

The context window mechanics predict where coding agents succeed and where they degrade. Short, focused tasks with clear termination criteria perform better than open-ended refactoring because context remains manageable and early file reads stay accessible throughout the task. An agent asked to fix a specific null check in a named file will outperform one asked to generally improve a module.

Tasks touching many files degrade more than tasks touching few. Every file read adds context pressure, and content in the middle of a long session is less reliably recalled. For large refactoring work, decomposing into file-by-file changes and running each as a focused sub-agent session consistently produces better results than a single long-running agent session. The context window mechanics make this outcome predictable rather than surprising.

Tasks with executable feedback produce more reliable outputs than tasks where correctness is implicit. An agent that can run cargo test after a Rust change receives direct, unambiguous information about whether its edit was correct. An agent adding business logic to an untested system has to reason about correctness from static analysis and type information alone, which requires much stronger prior understanding of the codebase.

All of the patterns that have emerged in serious coding agent practice, dedicated tools over bash wrappers, sub-agent delegation for complex tasks, parallel reads for cross-file analysis, execution feedback loops, follow from the same structural constraint: a model working through a bounded context window, where every token either contributes to the task or dilutes the signal the model needs to make its next decision.

Was this interesting?