The Context Window Is the Architecture: How Coding Agents Manage What They Know
Source: simonwillison
The agentic loop at the heart of every coding agent is straightforward: the model generates a tool call, the runtime executes it, the result gets appended to the conversation, and the model decides what to do next. Simon Willison’s guide on how coding agents work walks through this pattern clearly. What the guide does not fully unpack is how the context window constraint, the hard limit on how much information the model can hold in its working memory at once, drives nearly every non-trivial architectural decision these agents make.
Understanding this constraint is the key to understanding why different coding agents are built differently, and why they fail in the specific ways they do.
What the Loop Actually Looks Like on the Wire
Before getting to context management, it is worth being concrete about what the tool-calling protocol looks like. When you define tools for a model via the Anthropic Messages API, you provide JSON Schema descriptions alongside your messages:
{
"name": "read_file",
"description": "Read the contents of a file at the given path. Returns file content with line numbers.",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path to the file"
},
"offset": {
"type": "integer",
"description": "Line number to start reading from, zero-indexed"
},
"limit": {
"type": "integer",
"description": "Maximum number of lines to return"
}
},
"required": ["path"]
}
}
The model does not call this function directly; it generates a structured block describing the call it wants to make:
{
"type": "tool_use",
"id": "toolu_01xyzABC",
"name": "read_file",
"input": {
"path": "/project/src/auth.py",
"offset": 0,
"limit": 100
}
}
The runtime intercepts this, runs the actual file read, and returns a result that gets appended to the conversation history:
{
"type": "tool_result",
"tool_use_id": "toolu_01xyzABC",
"content": " 1\tclass AuthManager:\n 2\t def __init__(self, db):\n 3\t self.db = db\n..."
}
This conversation, with tool calls and results interleaved, is what the model sees on its next turn. The context window is the accumulation of this history. Every file read, every shell command output, every previous reasoning step is a chunk of tokens that has been added and cannot be removed.
The Accumulation Problem
A 200K token context window sounds large. In practice, a coding agent working through a non-trivial task can exhaust it faster than you might expect.
Consider a task like adding rate limiting to an authentication system. The agent starts by reading relevant files: the auth module, the router configuration, the existing middleware, the test suite. That covers the first several steps. Then it searches for all call sites, reads the database layer, looks at how the session store works. Then it writes the changes, runs the tests, reads the failure output, iterates on the fix, runs the tests again. Each step appends more tokens. The file reads from step three do not disappear after the agent has acted on them. The test output from two iterations ago is still sitting in context when the model decides what to do next.
This is not a pathological case; it is the normal operating condition of a coding agent working on anything beyond trivial changes. The agent is constantly reasoning about the current state of the codebase based on potentially stale evidence, accumulated from earlier in the conversation.
Retrieval Over Pre-Loading
The response most current agents have converged on is retrieval-based context loading: rather than pre-loading the repository into context, give the agent tools to find and read what it needs on demand, as it needs it.
Claude Code provides Glob and Grep tools for finding files by pattern and content. When starting a task, the agent searches for relevant files rather than loading everything. A task involving authentication will cause it to search for files matching patterns like *auth* or contents like authenticate, read those specific files, and build a focused view of the relevant code rather than a broad view of everything.
Aider takes a different approach with its repository map: it uses tree-sitter to parse the codebase and generate a compact structural index showing class and function definitions across all files without their implementations. The agent gets a lightweight overview of the entire codebase in a fraction of the context budget, then loads full file contents only for files it needs to edit. This sidesteps the loading problem by substituting a structural summary for full content.
Both approaches aim at the same goal: maximize the signal-to-noise ratio of the context by loading specific, relevant information rather than exhaustively loading everything that might be relevant.
Multi-Agent Patterns as Context Scaling
The most aggressive solution to context limits is the multi-agent pattern: spawn a subagent with a fresh context window for a bounded subtask, then return only the result to the parent agent.
This is architecturally straightforward once you realize that the “Agent” tool in systems like Claude Code is just another tool call from the orchestrator’s perspective. The orchestrator generates a tool call that delegates a subtask; the runtime spins up a new model context, runs a complete agentic loop for that subtask, and returns a summary. The orchestrator receives the summary, not all the intermediate file reads and tool results that produced it.
A simplified version of how this looks with a custom orchestration layer on the Anthropic API:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "explore_codebase",
"description": (
"Delegate codebase exploration to a subagent. Use this to research "
"how a specific system works before modifying it. Returns a structured "
"summary of findings without loading intermediate results into your context."
),
"input_schema": {
"type": "object",
"properties": {
"question": {
"type": "string",
"description": "What you want to understand about the codebase"
},
"paths": {
"type": "array",
"items": {"type": "string"},
"description": "Starting file or directory paths for the exploration"
}
},
"required": ["question"]
}
}
]
When the orchestrator calls explore_codebase, the runtime runs a subagent that may read dozens of files, accumulate its own context, and produce a summary. The orchestrator receives the summary without receiving all the intermediate reads. Those tokens stay in the subagent’s context, which gets discarded after the task completes.
Claude Code uses this pattern for its “Explore” subagent type. An Explore subagent can research a large repository and return a concise description of what it found, without that research cost hitting the parent agent’s context budget. The Claude Agent SDK makes multi-agent orchestration like this composable at the application level, not just an internal implementation detail.
The tradeoff is information loss. The summary the subagent produces is necessarily incomplete compared to the full context it accumulated. An orchestrator working from summaries may miss details that would change its approach. Getting the summary format right, compact enough to preserve context budget but detailed enough to be actionable, is one of the genuinely hard design problems in multi-agent systems.
How Tool Descriptions Shape Behavior
There is a subtlety in the tool definition format that has outsized practical impact: the description field is not just documentation. It is the primary mechanism through which the model decides when and how to use a tool.
Compare two versions of a bash tool definition:
{"name": "bash", "description": "Run a shell command"}
versus:
{
"name": "bash",
"description": "Execute a bash command in the working directory. Use for running tests, checking git status, installing packages, and other operations requiring execution. Prefer read_file for reading file contents; prefer grep for searching. Avoid using cat to read files."
}
The second version steers the model away from using bash to cat files. This matters because a dedicated read tool can apply pagination and line number annotations that make file contents more useful for the model to reason about. The model reads these descriptions on every turn and uses them to make tool selection decisions. Sloppy descriptions lead to inconsistent tool use; precise ones constrain the model toward behaviors that work better in practice.
The tool schema is the API for the agent. Like any API, the quality of the documentation determines how well it gets used, and unlike most APIs, there is no separate user to read the docs carefully. The model is the user, reading those descriptions under time and token pressure, deciding which tool to call next.
Where the Loop Degrades
The agentic loop works well when each step has clear, machine-readable feedback. Running a test suite and receiving a pass/fail status with a stack trace is high-quality feedback; the model gets specific information about what went wrong and where, in a structured format it can reason about directly.
The loop degrades when feedback becomes ambiguous or subjective. “Improve the error handling in this module” has no executable verification step. The agent can make changes and observe that the code still compiles, but it cannot verify that the error handling is meaningfully better. In these cases the model is evaluating its own work, and self-evaluation under ambiguity is where current models produce the most inconsistent results.
Security-sensitive changes expose this limitation sharply. A coding agent can verify that tests pass after a change, but it cannot verify that the change did not introduce a subtle authorization bypass or a timing-sensitive race condition. The verification loop only closes over the feedback mechanisms available to the agent. Anything outside those mechanisms remains invisible, regardless of how many times the loop runs.
This is why treating coding agents as systems with legible mechanics, as Willison’s guide encourages, is more useful than treating them as general-purpose code-writing black boxes. The context window accumulates. The tool descriptions steer behavior. The verification loop closes only where you give it something to close on. Each of these is an engineering variable, not a fixed property of the technology. Understanding them is what separates deploying an agent that works reliably from deploying one that works impressively in demos and inconsistently in production.