Inside the Tool Loop: Context, Edits, and Error Recovery in Coding Agents
Source: simonwillison
The core of every coding agent is a loop that most software engineers would recognize without much explanation. Call the model, execute the tool it requested, append the result to the context, call the model again. Repeat until the model emits a response without a tool call. This pattern appears in every production coding agent from Claude Code to Cursor to OpenAI Codex, and Simon Willison’s agentic engineering patterns guide traces its practical implications in careful detail.
The loop itself is straightforward to implement. What makes coding agents behave differently from one another, and what separates ones that actually work from ones that get stuck, is the set of constraints this architecture imposes and the engineering decisions made to work within those constraints.
The Context Window Is Working Memory
A coding agent has no persistent process state. Between tool calls, the only thing that survives is the conversation history passed into the next model call. The context window holds everything: the system prompt, the user’s request, every file the agent has read, every command it has run, every edit it has made, and all of its own reasoning. When the budget runs out, the agent has a problem.
The ReAct paper from 2022 (Yao et al.) formalized the pattern of interleaving reasoning with action: the model writes out its chain of thought before emitting a tool call, then the tool’s output (the observation) feeds back in as context for the next round. This think-act-observe loop traces directly to the context window being the only shared state. The model cannot remember anything it does not write down, so writing down reasoning before acting is structurally necessary for coherent multi-step behavior.
For long coding tasks, the context window fills up fast. The strategies for handling this exhaustion differ between agents:
Summarization compaction. When Claude Code’s context approaches its limit (around 85% capacity, per Anthropic’s documentation), a secondary model call summarizes the conversation so far. The system prompt and recent exchanges are preserved verbatim; the accumulated history in the middle is replaced with a dense summary. This is lossy by design. The alternative is dropping from the front, which loses the original task instructions and early file reads that informed everything that came after.
File system as swap space. An agent can write notes to files (PLAN.md, scratch files, TODO.md) and read them back in a later turn. This externalizes state past the context window limit. It is a simple pattern, but architecturally significant: the file system becomes extended memory for the agent’s working state.
Subagents with isolated contexts. Claude Code’s Task tool spawns independent subagents, each with a clean context window. The parent describes the subtask; the subagent runs it and returns a result. Each subagent starts with no accumulated history, which prevents context pollution between parallel workstreams but also means the parent must explicitly give each subagent whatever context it needs. This is the same trade-off you navigate with microservices or with fork-based child processes: isolation costs you shared state.
Tools as the System Call Interface
The tools available to a coding agent play the same role as system calls in a Unix process. The model operates in a constrained environment; tools are the defined interface through which it reaches outside and interacts with the real world.
The standard toolkit has converged across major coding agents:
- File read (with optional line-range for large files)
- File write and targeted edit
- Bash execution
- Directory listing and glob-based file search
- Regex content search
- Web fetch
Cursor’s leaked system prompt from early 2025 showed tools named codebase_search, read_file, edit_file, run_terminal_cmd, grep_search, file_search, and list_dir. Claude Code’s tools include Read, Write, Edit, Bash, Glob, Grep, WebFetch, and Task for spawning subagents. The names differ; the capabilities are nearly identical across products.
The bash tool functions as the general escape hatch when no specialized tool covers a need. Agents use it to run test suites, inspect git state, install packages, lint code, and grep for patterns the specialized search tool missed. The output of every command comes back into the context as an observation, making the shell the richest signal source an agent has about the state of the codebase.
Why the Edit Format Is an Engineering Decision
One of the less obvious but consequential choices in coding agent design is how file edits are represented. Two approaches are common.
Full file replacement: read the current file, ask the model for the new contents, write it back. This is easy to implement, expensive in tokens (you pay for the full read and the full write), error-prone on long files (models occasionally omit lines when regenerating hundreds of lines of code), and produces large diffs for small changes.
Targeted replacement: the model specifies a unique string to locate in the file and the replacement text. Anthropic calls this str_replace_editor in their tool definitions; Claude Code uses old_string and new_string parameters. The model must identify exactly what text currently exists before it can edit, which functions as a built-in verification step. If the model hallucinates content that is not in the file, the tool call fails immediately with a descriptive error, and the model can course-correct on the next turn.
The failure mode for targeted replacement is non-unique matches. A file containing twenty identical log statements will produce an ambiguous match. Good implementations handle this by requiring enough surrounding context to make the match unique, or by exposing line numbers as a tiebreaker. The tradeoff is that the model must read the file carefully before editing, rather than generating from memory.
Error Recovery Is a Structural Property, Not a Feature
In a conventional program, an error typically propagates as an exception or terminates the process. In a coding agent, an error is an observation like any other. It goes into the context window, and the model reads it on the next turn.
When a bash command exits with a non-zero code, when a file read fails because the path does not exist, when a test suite reports failures, all of that output becomes data the model can reason about. This self-correcting behavior is not bolted on; it follows directly from the tool loop architecture. The model observes failure and adapts, rather than the developer having to anticipate every failure mode in advance.
In practice, agents navigate unfamiliar codebases through exploration and observation. They read a file that seems relevant, follow import paths to discover other modules, run the test suite to establish a baseline, and adjust their plan based on what they find. The path through the task is not computed upfront; it emerges from the sequence of observations.
Cursor’s 2025 leaked system prompt included explicit instructions around error recovery: when a command fails, read the output, identify the specific error, address it before retrying. The instruction is mundane, but it distinguishes agents that recover from agents that retry blindly or give up.
The Permission Boundary
Every coding agent has to decide how much to do autonomously before pausing to confirm. Full autonomy executes every tool call without interruption; full caution asks the user before every write. The engineering challenge is calibrating the middle.
Claude Code’s default permission model requires confirmation for bash commands and file writes, with an opt-in --dangerouslySkipPermissions flag for fully autonomous runs. The model assesses risk based on reversibility: reading files is free; writing without a recoverable git state is not. Git checkpointing is the safety net that makes autonomy safer, because the agent can stage or commit after each logical unit of work and the user can review incremental progress.
OpenAI’s Codex takes a different approach by operating on a cloud-hosted git clone of the repository rather than the user’s local working tree. Edits are isolated until the user explicitly merges the agent’s output. This architectural choice moves the permission boundary to the merge step, giving the agent full autonomy within its sandbox without risk to the user’s local state.
These architectural details converged across products not through coordination but through constraint. The context window forces external state management. The need to verify edits before applying them pushes toward targeted replacement. The value of being able to recover from errors pushes toward detailed tool output. Each decision follows from the shape of the underlying problem, which is why implementations from different organizations look so similar beneath the surface.