Tool Calls All the Way Down: The Architecture Behind Coding Agents
Source: simonwillison
The agent loop is the foundational mechanism of every coding agent. Simon Willison’s guide to agentic engineering patterns lays out the core mechanics clearly, but the loop has deeper implications for agent behavior, failure modes, and scaffolding quality that are worth working through in detail.
The loop is simple. A client sends a prompt, a conversation history, and a set of tool definitions to an LLM API. The model responds with either a text completion or a tool call. If it calls a tool, the client executes that tool, appends the result to the conversation, and sends everything back to the model. This repeats until the model produces a final text response with no further tool calls.
In pseudocode:
messages = [{"role": "user", "content": user_prompt}]
while True:
response = llm.complete(messages=messages, tools=tools)
if response.stop_reason == "end_turn":
return response.text
if response.stop_reason == "tool_use":
for tool_call in response.tool_calls:
result = execute_tool(tool_call.name, tool_call.input)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_result(tool_call.id, result)})
This is roughly what the loop looks like against Anthropic’s Messages API. OpenAI’s function calling follows the same pattern with different field names. The scaffolding, the code that runs this loop, is the coding agent. The model is stateless; all state lives in the message history passed back with each request.
What Tools Actually Enable
The model’s world is its context window. Without tools, a coding agent can only read what you paste in and respond with text. Tools are the mechanism by which the agent interacts with a real codebase. A minimal coding agent needs at least four things: a file reader to inspect code it hasn’t seen, a file writer to apply changes, a shell executor to run tests and compilers, and a search tool to find relevant code without reading everything.
Those four tools cover the bulk of what coding agents do. The shell executor is the most powerful and the most dangerous, since it can run arbitrary commands. Agents like Aider have used direct shell access for years; newer agents add permission layers and confirmation prompts for destructive operations.
Here is what a basic tool definition looks like in the Anthropic API format:
{
"name": "read_file",
"description": "Read the contents of a file at the given path",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The absolute path to the file"
}
},
"required": ["path"]
}
}
The description field is not documentation for the developer; it is instruction for the model. The quality of that description determines whether the model uses the tool correctly and when it chooses to use it at all. Vague descriptions produce unpredictable behavior. This is one of the more underappreciated parts of building a reliable coding agent.
Context Window Pressure
The most important constraint in the agent loop is the context window. Every tool result gets appended to the conversation history, and that history grows with every iteration. Reading a 2,000-line file, running a test suite that outputs 500 lines of errors, and searching across a large codebase can each consume tens of thousands of tokens. As the context fills up, the model’s effective attention on early messages degrades, and eventually the conversation exceeds the context limit entirely.
Different agents handle this differently. Aider uses a summarization approach, compressing older messages when the context grows too large. Claude Code tracks token usage and prunes or summarizes when approaching limits. Some agents chunk file reads to avoid ingesting large files all at once. None of these solutions are free: summarization loses fidelity, truncation drops information, and chunking adds round trips.
This is also why coding agents sometimes lose track of decisions made early in a long session. A model that committed to a particular design in turn 3 of a 40-turn conversation may not weigh that decision heavily when making a related choice in turn 38. The architecture is literally built on a sliding window of attention, and context management is engineering work, not a solved problem.
Planning and the Limits of Statelessness
The agent loop is stateless from the model’s perspective. Each API call gives the model a fresh set of tokens representing the conversation so far. The model has no persistent memory, no background reasoning process, and no internal state between calls. What looks like a planning session is the model reasoning over its own prior text completions, all of which are sitting in the conversation history as ordinary tokens.
The ReAct paper (Yao et al., 2022) formalized the think-act-observe pattern that most current agents follow: the model reasons about what to do, takes an action via a tool call, observes the result, and reasons again. The pattern works well for linear tasks but struggles with tasks that require backtracking or holding multiple hypotheses simultaneously. The conversation history is an append-only log; there is no built-in mechanism for the model to undo a line of reasoning that turned out to be wrong.
Some frameworks address this with explicit planning steps. Before entering the tool loop, the model produces a written plan as a text response. That plan becomes part of the context and serves as an anchor for subsequent decisions. This is a prompt engineering workaround for an architectural limitation, but it works well in practice for moderately complex tasks.
Scaffolding Quality Is Half the Product
The scaffolding is everything outside the model: the loop itself, tool implementations, error handling, context management, permission systems, and the system prompt. The model and the scaffolding interact continuously, and the quality of each determines the quality of the agent.
A poorly written file-reading tool that silently truncates large files causes the model to make decisions based on incomplete information, with no way to know the information was truncated. A shell executor that swallows stderr causes debugging loops to spin indefinitely. A system prompt that does not clearly define the agent’s role or constraints produces inconsistent behavior across sessions.
This is a significant reason why open-source coding agents vary so much in quality even when they use the same underlying model. Aider has spent years refining its scaffolding, prompting strategies, and context management. The model it uses matters less than the accumulated engineering surrounding it. The same observation applies to commercial agents: the model capability ceiling is often not the binding constraint.
Parallel Tool Calls
Modern LLM APIs support parallel tool calls, where the model returns multiple tool calls in a single response. This allows the client to execute them concurrently before sending all results back together, reducing round trips for tasks like reading several related files before synthesizing an answer.
[
{"type": "tool_use", "id": "t1", "name": "read_file", "input": {"path": "/src/auth.py"}},
{"type": "tool_use", "id": "t2", "name": "read_file", "input": {"path": "/src/middleware.py"}},
{"type": "tool_use", "id": "t3", "name": "search_code", "input": {"query": "authenticate_user"}}
]
Handling parallel tool calls correctly requires matching each result back to its corresponding tool call ID when appending to the conversation history. This is straightforward in the happy path but easy to get wrong when tools fail partially, and returning results out of order or conflating them produces confusing model behavior that can be hard to diagnose.
What This Architecture Cannot Do Well
Long-horizon tasks that require persistent state across sessions need some form of external memory, written to files, stored in a database, or passed as structured context at the start of each session. The conversation history disappears when the session ends, and there is no standard mechanism for agents to resume mid-task across processes.
Tasks requiring genuine parallelism, where multiple agents need to coordinate and share state in real time, require multi-agent orchestration beyond the basic loop. Frameworks like LangGraph address this, but they introduce significant coordination complexity and make debugging substantially harder.
Tasks where the correct action is genuinely ambiguous benefit from human-in-the-loop checkpoints. The model will make a decision based on available context, and that decision may not align with what the user intended. Good agents surface ambiguity before acting rather than making confident wrong choices and propagating them through subsequent steps.
Building a Minimal Agent
If you want to understand coding agents from the inside, building a minimal one is the fastest path. The scaffolding for a basic agent is 100 to 200 lines of code. You need an LLM client, a handful of tool implementations, and the loop. The interesting engineering starts after you have that baseline running: how do you manage context pressure, handle partial tool failures gracefully, and write tool descriptions that produce reliable model behavior?
The fundamental point Simon Willison makes in his guide holds up: coding agents are not magic. They are an LLM, a loop, and a set of tools. The sophistication is in how those three things fit together, and that architecture is entirely legible if you read the code.