The Context Window Is the Process: What Coding Agents Are Actually Doing
Source: simonwillison
Simon Willison published a thorough guide on how coding agents work as part of his agentic engineering patterns series. It’s one of the clearest explanations of the mechanics I’ve seen. What I want to do here is take that foundation and push into the parts that matter most for understanding why these systems behave the way they do, and where they break.
Most descriptions of coding agents start with the output: “it reads files, writes code, runs tests.” That’s accurate but it skips the part that actually determines the system’s behavior. The interesting layer is how the agent maintains state between steps, what information it has available at any given moment, and what it genuinely cannot do.
The Loop
At the mechanical level, a coding agent is a ReAct-style loop. The model receives a prompt, emits either text or a tool call, the tool call executes, the result is appended to the conversation, and the model runs again. This continues until the model emits a final answer with no tool call.
This isn’t a novel architecture. ReAct was described in 2022. What changed is that modern models are reliable enough at tool use to make the loop practically useful, and context windows are large enough to accumulate meaningful work history before they fill up.
The loop looks roughly like this in pseudocode:
messages = [system_prompt, user_task]
while True:
response = model.complete(messages)
if response.stop_reason == "end_turn":
break
for tool_call in response.tool_calls:
result = execute_tool(tool_call)
messages.append(tool_result(tool_call.id, result))
messages.append(response)
Every framework wraps this pattern differently, but the mechanics are the same. Claude Code, Cursor, Aider, and Cline are all variations on this loop with different tool sets and prompting strategies.
The Context Window Is the Only Memory
This is the detail that most explanations underplay. The context window is not a place where “the agent thinks.” It is the agent’s entire working memory. There is no hidden state, no external scratchpad the model maintains between invocations. Everything the agent knows at step N is contained in the message history up to step N.
In systems terms: the context window is the process’s address space. Each model invocation is not a persistent process; it’s a stateless function call that returns the next action given everything that’s happened so far. The agent has no notion of time passing between tool calls except through the content of the messages themselves.
This has concrete consequences:
File contents must be read into context to be used. If the agent reads main.rs at step 3 and then edits lib.rs at step 15, it still has the full content of main.rs in its context window unless something caused it to be evicted. The model has not “loaded the file into memory” in any persistent sense; the text is just there in the conversation history.
Errors are visible only if the tool returns them. When a bash command exits nonzero, the agent sees the exit code and stderr because the tool implementation captures and returns both. If a tool silently swallows errors, the model has no way to know something went wrong.
Context length determines maximum task complexity. A task that requires reading 30 files and running 50 tool calls accumulates a lot of tokens. Claude 3.7 Sonnet has a 200K token context window. That sounds large until you’re working on a codebase where individual files exceed 1,000 lines and test output is verbose.
Tool Design Is API Design
The tools an agent has access to determine what kinds of tasks it can complete. This is not an obvious statement. A well-designed tool set can make hard tasks tractable; a poorly-designed one will cause the agent to fail even on simple tasks.
Consider file editing. A naive tool set might offer read_file(path) and write_file(path, content). To edit a function, the agent would read the entire file, modify the content, and write it back. This works, but it requires the model to regenerate potentially thousands of lines of unchanged code perfectly, consuming tokens and introducing opportunities for hallucination.
A better primitive is str_replace(path, old, new): find this exact string in the file and replace it with this other string. The agent only needs to specify what changes, not the entire file. This is essentially the approach taken by Claude Code’s file editing tools and tools like Aider’s diff-based editing.
Similarly, providing grep(pattern, path) as a tool is strictly better than providing read_file when the agent only needs to locate something. Reading a 2,000-line file to find three occurrences of a function name wastes context window space that could be used for other information.
The Model Context Protocol (MCP) attempts to standardize how tools are exposed to agents across different environments. Whether that standardization succeeds will depend on whether tool authors converge on sensible primitives rather than just wrapping existing CLI interfaces.
Subagents and Delegation
For tasks that exceed a single context window, or that benefit from parallelism, agents can spawn subagents. The parent agent makes a tool call like run_agent(task, context), and the framework spins up a fresh agent with its own context window to handle the subtask.
This is effectively a form of distributed RPC. The parent agent delegates a unit of work, waits for a result, and incorporates that result into its own context. The child agent has no access to the parent’s context except for what’s explicitly passed in.
The implication is that task decomposition matters a lot. The parent agent needs to provide enough context for the child to succeed without passing so much that it duplicates the parent’s entire working memory. Getting this boundary right is one of the harder parts of designing multi-agent pipelines.
Claude Code supports this via the Task tool, which spawns a subagent with a specified prompt and returns its output. The Anthropic agent SDK exposes similar primitives for custom applications.
Context Management in Practice
When context windows fill up, frameworks have to make decisions about what to discard. The naive approach is sliding window truncation: drop the oldest messages. This breaks badly if an early message contained a file read that later messages depend on.
More sophisticated approaches include:
- Compaction: asking the model to summarize older context before truncating it
- Selective retention: keeping tool results that are likely still relevant (open file contents, test output from the most recent run)
- Re-reading: having the agent proactively re-read files it needs before its context gets too full
Claude Code handles this with a /compact command that triggers context compression. The model summarizes the conversation so far, and that summary replaces the full history. You lose precision but retain the gist of what’s been done.
This is an area where the field is still figuring things out. There’s no established best practice, and different frameworks make different tradeoffs.
The Prompt Injection Problem
When an agent reads files, fetches URLs, or processes user-provided data, that content lands directly in the context window. A file could contain text that instructs the model to do something unintended. This is prompt injection, and it’s harder to solve for agents than for chatbots because agents are specifically designed to act on instructions embedded in external content.
The OWASP Top 10 for LLM Applications lists prompt injection as the top risk. For coding agents with write access to a filesystem and the ability to execute arbitrary commands, the attack surface is substantial. A malicious CONTRIBUTING.md or a poisoned .env.example file could redirect an agent’s behavior in ways that are hard to audit from the outside.
Defensive measures include sandboxing tool execution, running agents without network access by default, and requiring human approval for destructive operations. Claude Code implements several of these: it shows diffs before applying edits and asks for confirmation before running commands it considers risky.
What This Means for Using Agents Well
Understanding the execution model changes how you write tasks. Vague prompts work poorly not because the model is confused about semantics, but because it starts reading files at random and filling its context with things that don’t bear on the actual goal. A good task prompt scopes the work, points the agent at the relevant files, and specifies what success looks like.
For longer tasks, breaking work into chunks that fit comfortably within a context window is more reliable than hoping the agent manages its context well on its own. The tools available determine what the agent can actually do, and the context window determines how much it can keep in mind at once.
These are engineering constraints, not model limitations. Working with them rather than against them is what makes the difference between an agent that reliably completes tasks and one that spirals into confusion after ten steps.