The Agent Loop Is a Conversation: How Coding Agents Actually Execute Tasks
Source: simonwillison
Coding agents have become a standard part of software development, but their internals are often treated as black boxes. Simon Willison’s guide on agentic engineering patterns gives a clear map of the territory. This post zooms into the mechanical core of that map: the tool loop, how context accumulates, and what that means for building and debugging agents that actually hold together.
The Loop Is the Architecture
At the foundation, a coding agent is a loop. The loop works like this:
- Assemble a context window: system prompt, conversation history, tool results so far
- Send to the LLM and receive a response
- If the response contains tool calls, execute them and append the results to context
- If the response contains no tool calls, return it to the user as a final answer
- Otherwise, go to step 1
This is sometimes called the ReAct pattern, from the 2022 paper by Yao et al. that demonstrated interleaving reasoning traces with actions. The insight is that the model should generate a thought, then an action, then observe the result, then think again. Modern LLM tool-use APIs operationalize this directly. In both the Anthropic and OpenAI APIs, tool calls are structured outputs in the model’s response, and tool results are messages appended before the next call.
Here is what a minimal version of this loop looks like with the Anthropic Python SDK:
import anthropic
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Read main.py and find any obvious bugs"}]
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
break
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
The context grows with every iteration. Each tool call and its result stays in the message history. By the time an agent has read ten files and run three shell commands, the context window contains all of that, including every intermediate reasoning step. This has direct implications for cost, latency, and what the model can see at any given point in the run.
What Makes Coding Agents Different
General-purpose agents might browse the web, call external APIs, or query databases. Coding agents specifically operate on a filesystem and a shell. Their canonical tool set is:
- read_file: Read the contents of a file at a given path
- write_file / edit_file: Write or apply targeted edits to a file
- bash / run_command: Execute a shell command and capture stdout and stderr
- search / grep: Search for patterns across files
- list_directory: Enumerate paths in a directory
The shell tool is the most powerful and the most dangerous. It subsumes most of the others: you can read files with cat, write them with redirects, and search with grep. But raw shell access also means the agent can delete files, install packages, make network requests, or do anything the process user can do. This is why most mature coding agent frameworks add a confirmation layer, or scope shell execution to a container.
The edit tool deserves particular attention. Writing entire file contents on every change is wasteful; it fills the context with file contents twice, once when reading and once when writing back. Most agents instead use a diff-based or search-replace-based edit tool, where the agent specifies the exact string to replace and what to replace it with. Claude Code uses this approach. The constraint forces the model to be precise about what it is changing, and the tool can reject edits where the search string does not match, surfacing a category of error before it propagates.
Context Accumulation and Its Consequences
The context window is both the agent’s working memory and its execution log. Every observation appended during a run is visible to subsequent steps. This has effects that are easy to miss.
First, the model’s behavior at step N is conditioned on everything from steps 1 through N-1. An agent that read a file early in a run, then had that file modified by a later tool call, might still reason from the stale content if it does not re-read the file. Some agent frameworks address this by injecting a note in the tool result that a file has changed, or by re-reading modified files automatically.
Second, context pressure is real. The Claude API supports up to 200,000 tokens for current models, but agent runs on large codebases accumulate context quickly. Reading ten moderately large source files can consume 40,000 to 60,000 tokens before the agent has written a single line of output. This drives architectural decisions: agents that can summarize intermediate results, truncate tool outputs, or use a retrieval layer to avoid reading entire files are more robust on large tasks.
Third, tool output quality directly affects model output quality. If a bash command fails silently, or returns a wall of stack trace without the relevant error line, the model’s next step will be based on noise. A well-designed tool truncates from the middle rather than the end (keeping the beginning and end of output where errors usually appear), tags error lines explicitly, and formats structured data as tables or abbreviated JSON rather than raw output.
Error Recovery
A real coding agent run will encounter errors. Tests fail, files are not where expected, shell commands exit nonzero. The question is whether the agent recovers cleanly or spirals.
The basic recovery pattern is to include error output in the tool result and let the model respond to it. If npm test fails, the full output including the failing assertion and stack trace goes back as the tool result, and the model reads it and decides what to edit. This works well when errors are specific and the model has enough context to understand them.
Where it breaks down is with cascading errors. The model makes an edit that introduces a syntax error, which causes the test command to fail with a parse error rather than a test failure, which the model misinterprets as a test infrastructure problem, which causes it to try reinstalling dependencies. The intervention point is usually the tool layer: an edit tool that validates syntax before writing, or a linter pass that runs automatically after edits, can short-circuit the cascade before it develops.
Some frameworks add an explicit retry budget: the agent gets N attempts to fix a failing test, and if it has not succeeded after N iterations, it surfaces the problem to the user. This is a reasonable guard for production use, where unbounded loops are a cost and latency risk.
The System Prompt as Constitution
The loop diagram does not make visible how much work the system prompt does. The system prompt for a coding agent typically specifies:
- What tools are available and how to use them
- When to ask for confirmation before taking destructive actions
- How to format edits (use the search-replace tool, not full file writes)
- How to handle ambiguity (ask the user versus make a reasonable assumption)
- Constraints on what the agent should and should not change
Claude Code’s system prompt includes explicit guidance about preferring targeted edits over full rewrites, reading files before editing them, and not making changes beyond the scope of the request. These constraints are not enforced mechanically at the API layer; they are behavioral nudges that shape how the model uses its tools.
The distinction matters for reliability. A system prompt constraint can be violated if the model is sufficiently confused or if the conversation history is long enough to dilute it. Mechanical enforcement (a tool that simply refuses to write a file that was not first read in the current session) is more robust, but also more rigid. Production coding agents tend to use both: soft constraints in the system prompt for nuanced behavior, hard constraints at the tool layer for invariants that must hold.
Why the Loop Mental Model Matters
Coding agents are not magic. They are a loop, a context window, a set of tools, and a system prompt. Understanding that loop is what separates debugging a stuck agent from staring at it helplessly. When an agent makes a wrong assumption, you can find the tool result that seeded it. When it hallucinates a file path, you can see that it never called list_directory to verify. When it spirals on an error, you can trace which tool output started the spiral.
Willison’s framing in his agentic engineering patterns guide treats this as an engineering discipline. That framing is right. The interesting work in coding agents lives in the tool design, the context management strategy, the error recovery heuristics, and the confirmation boundaries. The model is one component in a system, and it is the system design that determines whether the agent is actually useful.