Context Is the Only State: The Design Constraint That Shapes Every Coding Agent
Source: simonwillison
Simon Willison published a thorough guide on how coding agents work that is worth reading in full. The core loop it describes is genuinely simple: the model gets a system prompt and conversation history, picks a tool call, the scaffold runs it, the result gets appended to the conversation, repeat. What the guide surfaces, and what I want to dig into here, is how one constraint shapes almost every design decision in these systems: the context window is the agent’s only working memory, and everything in it costs.
The Loop and Why State Matters
The ReAct pattern formalized in 2022 describes the reasoning-action cycle that underlies every coding agent in production. At the API level, it looks like this:
while not done:
response = llm.complete(messages=conversation_history, tools=tool_definitions)
if response.stop_reason == "tool_use":
result = execute_tool(response.tool_call)
conversation_history.append(response.tool_call)
conversation_history.append({"role": "tool", "content": result})
else:
done = True
There is no hidden state. No database of facts the model accumulates. No memory that persists between invocations. Everything the agent knows at any moment exists in the token stream passed to the next inference call. If a file was read ten turns ago and the conversation has grown past where that read lives in reliable attention, the agent effectively does not know what it read.
This sounds like an implementation detail, but it explains almost everything about how coding agents are built.
Tool Schemas Are Context Budget Control
Every tool definition the agent receives costs tokens. More importantly, every tool result the scaffold appends also costs tokens. The design of the tool schema determines how aggressively context fills up.
Consider two versions of a file reading tool:
// Version A
{
"name": "read_file",
"description": "Read a file",
"input_schema": {
"properties": {
"path": { "type": "string" }
}
}
}
// Version B
{
"name": "read_file",
"description": "Read a file at an absolute path. Use offset and limit for large files. Always re-read before editing to ensure accuracy.",
"input_schema": {
"properties": {
"path": { "type": "string" },
"offset": { "type": "integer", "description": "Line to start reading from" },
"limit": { "type": "integer", "description": "Maximum lines to return" }
}
}
}
Version A loads entire files unconditionally. A medium-sized codebase where the agent reads ten files consumes roughly 25% of a 200k-token context before it writes a single line of code. Version B teaches the model to take surgical reads and re-read before editing, encoding best practices into the affordance itself.
The Princeton SWE-agent research quantified how much this matters. Changes to tool descriptions and output formatting produced several percentage point swings in SWE-bench scores, independent of the underlying model. The researchers introduced the concept of the Agent-Computer Interface by analogy with HCI, arguing that interfaces need to match model cognition, not just expose capabilities.
Four Strategies for Navigating a Codebase
Given that context is finite, agents cannot read their way through a codebase. They have to navigate. Four strategies have emerged, each with different tradeoffs against the same underlying constraint.
Grep and shell commands are what Claude Code primarily uses. The model calls rg -n "AuthService" --type ts, gets back file paths and line numbers, reads only what it finds. Token cost per query is low. The failure mode is precision: search for a common identifier like config and you get hundreds of results. The navigation overhead before the model has sufficient context typically runs to fifteen or twenty tool calls.
Repository maps are Aider’s distinctive contribution. At the start of each turn, Aider parses the entire codebase with tree-sitter, extracts function signatures and cross-file references, runs a PageRank-style algorithm to prioritize recently touched files, and injects a compact structural overview into context. The map typically runs one thousand to eight thousand tokens. No index server, no embeddings infrastructure, just a fast parse that gives the model enough structure to navigate without reading files it does not need.
Language Server Protocol queries give exact semantic answers. Go-to-definition, find-references, type information. No false positives from shared strings. Cursor uses LSP integration as a primary navigation mode for exactly this reason. The fragility is environmental: if the project has missing dependencies, misconfigured build paths, or an unsupported language, the language server either fails or returns wrong results. For well-configured TypeScript or Go codebases, LSP is substantially more reliable than grep.
Embedding-based semantic search indexes chunked source files in a vector store and retrieves top-K similar chunks at query time. Cursor’s @codebase command works this way. The advantage is scale and semantic generalization: it handles large codebases and finds code that uses different naming conventions than the query. The cost is infrastructure and noise. Codebases where many files share domain terminology with the query produce high false-positive rates.
No strategy is universally better. Production systems increasingly combine them.
The Three Ways to Edit a File
File editing strategies are also shaped by the context constraint, but through reliability rather than budget.
Full rewrites are the simplest path: read the file, regenerate the whole thing, write it back. This works for small files or new files. For files over a few hundred lines, models introduce subtle changes on unchanged lines, hallucinate details from stale context, or silently alter formatting. The token cost also scales directly with file size.
String replacement is what Claude Code’s Edit tool implements:
{
"file_path": "/path/to/file.ts",
"old_string": "const token = cache.get(userId)",
"new_string": "const token = await cache.get(userId)"
}
The scaffold does a literal string match. If the string does not exist, it returns an explicit error. If the string appears multiple times without replace_all: true, it rejects the call. These hard failures enable self-correction: the model gets clear signal that something went wrong and can diagnose the mismatch. The brittleness is that a single misquoted character breaks the match, and models quoting code from stale context windows lose accuracy over long sessions.
Fuzzy SEARCH/REPLACE blocks are Aider’s approach. The model produces structured edit blocks and Aider tries exact match first, then falls back to difflib’s SequenceMatcher for near-misses. Minor whitespace differences no longer cause hard failures. The tradeoff is determinism: when the scaffold applies a fuzzy match, the user cannot tell whether it found the intended location or an adjacent one that looked similar. In files with repeated patterns this produces subtle, hard-to-notice errors.
Over many sequential edits, the difference in failure visibility compounds. String replacement gives a clear error on every bad attempt. Fuzzy matching might silently apply twenty edits to slightly wrong locations before the model notices the code does not work.
The Scaffolding Gap
The most practically important finding in recent agent research is that the scaffolding matters more than the model. The same base model running on a naive “here’s bash, go” scaffolding and a carefully engineered harness with proper tool schemas, output truncation, navigation strategies, and context management shows a performance gap of thirty percentage points on SWE-bench Verified. That gap is not explained by any capability difference in the model. It is entirely the engineering around the loop.
This has a direct implication for anyone building on top of coding agents or evaluating them. Benchmark numbers reflect the scaffolding as much as the model. When Claude Code scores roughly 49% on SWE-bench Verified and a naive harness using the same Claude model scores 18%, the difference is in how tools are designed, how output is truncated, how context is managed across turns, and how planning is separated from execution.
The observation extends to constraint enforcement. Instructions in a system prompt are advisory. The model follows them most of the time, especially early in a session. Over long sessions they drift toward the middle of context where attention degrades. Instructions that survive compaction do so as paraphrases, which is worse than the original for anything requiring precision. Hard constraints, like “never modify the migrations directory,” belong in scaffolding hooks that run before tool execution, not in CLAUDE.md.
Building With This in Mind
If you are building tooling on top of a coding agent or writing the context files that shape its behavior, a few things follow from the above.
Tool descriptions encode behavior. Write them as instructions, not labels. Include guidance on when to use partial reads, when to re-read before editing, and what the error messages mean.
Context anchor files like CLAUDE.md work best for information that should always be present, because they live at the start of context. They are poor choices for dynamic constraints added mid-session.
Navigation strategy determines session length. An agent that reads entire files instead of using partial reads, repo maps, or grep will hit context limits on medium-sized codebases before completing tasks. This is a scaffolding problem, not a model capability problem.
The most complete technical treatment I have found is Willison’s guide linked above, alongside the SWE-agent paper and Aider’s documentation on repo maps. The implementation details vary across Claude Code, Cursor, and Aider, but the constraint driving the design is the same in all of them.