The Loop Is the Agent: What Actually Happens Inside a Coding Tool
Source: simonwillison
The premise of a coding agent is simple enough to state in two sentences. Give an LLM access to a set of tools, then loop: the model looks at the conversation history, decides whether to call a tool or produce a final answer, and if it calls a tool, the result gets appended to the history and the loop runs again. Simon Willison’s guide to agentic engineering patterns puts it plainly: the “agentic” part is just this loop, and the model drives it.
What makes coding agents interesting is not the loop itself, which any competent developer could implement in an afternoon. The interesting parts are the decisions made by the scaffolding layer that wraps the loop: how tools are defined, how the growing context is managed, how errors are surfaced back to the model, and how the system decides when to stop.
What Tool Calling Actually Looks Like
Modern LLM APIs expose tool calling through a structured mechanism. You define tools as JSON schemas, pass them alongside your messages, and the model responds with either a text completion or a structured tool call object. With Anthropic’s API, that looks roughly like this:
tools = [
{
"name": "read_file",
"description": "Read the contents of a file at the given path",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute path to the file"
}
},
"required": ["path"]
}
}
]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
tools=tools,
messages=messages
)
When the model wants to read a file, it returns a tool_use content block with the tool name and a JSON payload matching the schema. The scaffolding executes the actual file read, wraps the result in a tool_result block, appends both to the message history, and calls the API again.
This is the complete cycle. Every iteration adds at minimum two entries to the conversation: the model’s tool call, and the tool’s result. After ten iterations on a moderate codebase exploration, you have twenty extra message entries on top of whatever code and instructions already occupied the context.
The ReAct Pattern Under the Hood
The intellectual foundation for this loop is the ReAct paper from 2022, which stands for Reason plus Act. The core insight was that interleaving verbal reasoning (“I need to find the function that handles authentication”) with concrete actions (“search for def authenticate”) produces dramatically better results than either pure reasoning or pure action. Language models are good at narrating what they are about to do, and that narration makes subsequent tool calls more accurate.
Coding agents implement this implicitly. The model’s text output before a tool call serves as the reasoning trace. This is why Claude Code, Cursor, and similar tools stream the model’s thinking as it works: that text is not just for user reassurance, it actively shapes the quality of the tool calls that follow.
The pattern also explains why giving a model a large scratchpad before forcing a decision consistently outperforms asking for immediate answers. Systems like Claude’s extended thinking surface this explicitly. In standard coding agent loops, you get a version of it for free through the conversational format.
Context Window as Working Memory
The context window is the agent’s only working memory. Every file it has read, every shell command it has run, every error it has encountered is present as text in that window or it might as well not have happened. This creates a set of practical constraints that shape how agents are designed.
A naive implementation appends everything to the context indefinitely. This works until it does not: you hit the context limit mid-task, the oldest relevant information falls off, or the sheer volume of prior tool results degrades generation quality because the model’s attention is spread too thin. Long-context models help but do not eliminate the problem.
Practical mitigations take several forms.
Selective inclusion. Rather than including full file contents for every read, agents can summarize, truncate, or extract only the relevant sections. Aider, for instance, maintains a repository map that gives the model a structured overview of the codebase using tree-sitter parse trees, so it does not need to read every file cold.
Summarization passes. Some systems periodically compress the conversation history, asking the model to produce a summary of completed work before continuing. This is lossy but extends effective task length.
Tool result limits. Grep output, directory listings, and shell commands can produce enormous amounts of text. Scaffolding that truncates or paginates these results keeps the context from filling with noise.
The tradeoff is always between information density and recency. A highly compressed context loses details that turn out to matter. A verbose context runs out of space. The right balance depends on task length and complexity.
How Errors Flow Back
One of the less-discussed mechanics is how error states propagate through the loop. When a tool call fails, whether because a path does not exist, a shell command exits non-zero, or a file read returns nothing, the scaffolding has several options.
The simplest approach: return the error as the tool result and let the model decide what to do. This works well because it mirrors how a human developer reads a stack trace and adjusts their approach. The model sees FileNotFoundError: /src/auth.py and searches for the file in a different location. The error becomes part of the reasoning trace.
What works less well is silently swallowing errors or returning empty results. The model will often hallucinate that the operation succeeded and proceed on false assumptions. Explicit error text in tool results is almost always preferable.
Some scaffolding layers add retry logic at the tool level, but this creates a risk of masking problems that the model should know about. A file that intermittently fails to read is interesting information. A tool layer that retries silently hides it.
The Scaffolding Is the Product
The LLM is not where the differentiation happens between coding agents. Claude Code, Cursor, Copilot Workspace, and Aider all have access to capable foundation models. The differences in experience come from the scaffolding: which tools are exposed, how their results are formatted, how context is managed, what system prompts shape the model’s behavior, and how the loop terminates.
A well-designed tool set matters enormously. A read_file tool that returns raw bytes is less useful than one that handles encoding, strips binary content, and returns structured metadata alongside the text. A run_shell tool that captures both stdout and stderr and includes the exit code gives the model more to work with than one that returns only stdout.
Tool descriptions matter too. The description field in a tool’s JSON schema is part of the model’s input on every call. Vague descriptions produce vague tool use. Precise descriptions, including what the tool does and does not do and what edge cases to expect, measurably improve model behavior. This is not a minor implementation detail; it is closer to the primary interface between the scaffolding and the model.
This is why building a coding agent from scratch is a useful exercise even if you plan to use an existing one. You quickly discover that the hard problems are not in calling the API. They are in deciding what the agent should be able to see, what it should be able to do, and how to represent the results of those operations back to a model that has only text to work with.
What Changes at Scale
Single-file edits are well within what even a simple agent loop handles reliably. Cross-repository changes, dependency upgrades, and multi-step refactors expose the limits more quickly.
The primary limit at scale is not model capability but context coherence. A model that has read 40 files, run 15 shell commands, and seen 8 error messages may start to lose track of which constraints it established early in the conversation. Checkpointing, where the model explicitly re-states its current understanding and goals, partially addresses this. Some agent frameworks implement this as a structured planning step at the beginning of long tasks.
The secondary limit is tool latency. Each iteration of the loop involves at least one API call and one tool execution. On slow networks or with expensive tools, a 30-step task can take minutes. This is why streaming matters: users need continuous feedback that progress is being made, not a long silence followed by either a completed diff or an error.
The third limit is determinism. Coding agents are probabilistic systems making decisions about deterministic artifacts. A model that decides to use a slightly different import style halfway through a refactor, or that forgets it already handled a particular edge case, produces inconsistent output that requires human review. The scaffolding can enforce some consistency through explicit instructions and structured output schemas, but not all of it.
These limits are not reasons to avoid coding agents. They are parameters that define where agents work well and where they need human intervention. Understanding them from the inside, by understanding the loop and what drives it, is what lets you use these tools where they are strong and not where they are not.