· 7 min read ·

The Tool Loop as Architecture: What's Actually Happening Inside a Coding Agent

Source: simonwillison

Simon Willison recently published a thorough guide on how coding agents work as part of his broader series on agentic engineering patterns. It is worth reading on its own terms, but I want to go deeper on a specific angle: the mechanics are deceptively simple, and it is the implications of those mechanics that determine whether an agent succeeds or fails on any non-trivial task.

The basic structure is not mysterious. A coding agent is a loop: the model produces a response, that response may include tool calls, the host application executes those tools, and the results feed back into the next model call. This repeats until the model produces a response with no tool calls, or until some stopping condition is met. That is the whole thing.

But describing the loop abstractly obscures what is interesting about it. Let me walk through what is actually happening at the API level, then trace through the implications.

What the API Sees

When you call a model with tool use enabled, you pass a list of tool definitions alongside your messages. In the Anthropic API, that looks like this:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "read_file",
        "description": "Read the contents of a file at a given path. Returns the raw text content.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {
                    "type": "string",
                    "description": "Absolute path to the file"
                }
            },
            "required": ["path"]
        }
    },
    {
        "name": "run_bash",
        "description": "Execute a bash command and return stdout and stderr combined.",
        "input_schema": {
            "type": "object",
            "properties": {
                "command": {
                    "type": "string",
                    "description": "The bash command to run"
                }
            },
            "required": ["command"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    tools=tools,
    messages=[
        {"role": "user", "content": "What test files exist in this project?"}
    ]
)

If the model decides to use a tool, the response’s stop_reason will be "tool_use" and the content blocks will include tool_use objects:

{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "text",
      "text": "Let me check the project structure."
    },
    {
      "type": "tool_use",
      "id": "toolu_01abc",
      "name": "run_bash",
      "input": {"command": "find . -name '*.test.*' -o -name '*.spec.*' | head -50"}
    }
  ]
}

Your application executes that command, then sends a new request with the full conversation history plus the tool result:

messages = [
    {"role": "user", "content": "What test files exist in this project?"},
    {"role": "assistant", "content": response.content},
    {
        "role": "user",
        "content": [
            {
                "type": "tool_result",
                "tool_use_id": "toolu_01abc",
                "content": "./src/utils.test.ts\n./src/api.spec.ts\n./tests/integration.test.ts"
            }
        ]
    }
]

The model then continues. If it needs more information, it issues more tool calls. If it has enough to answer, it produces a final text response with stop_reason: "end_turn". The conversation history grows with each round.

The Context Window Is the Agent’s Entire State

This single fact changes how you think about everything else. A coding agent has no memory outside the context window. There is no background database lookup, no episodic memory that persists between calls, no state machine running elsewhere, unless the agent explicitly reads and writes files that serve that purpose.

Every tool call result, every file read, every bash command output accumulates in the context. A session that reads a dozen source files, runs a test suite, debugs failures, and iterates on a fix will have a very long context by the end. Claude’s context window is 200K tokens as of early 2026; GPT-4o supports 128K. Claude Code, Anthropic’s terminal-based coding agent, will summarize earlier parts of a conversation when context gets long, compressing the working memory to make room for new information.

That compression is a design choice with real trade-offs. The compressed summary loses fidelity. Something the model understood clearly from reading raw file output may become ambiguous after summarization. This is one reason very long coding sessions can produce subtly worse results toward the end: not because the model’s capabilities degrade, but because its available context has been summarized and some precision has been lost.

Tool Schemas Are Interface Design

The specific tools an agent has access to, and how those tools are defined, directly shape what the agent can and cannot do. This is worth treating as an API design problem rather than a configuration detail.

Claude Code distinguishes between a broad Bash tool for arbitrary shell execution and more specific tools like Read, Write, Edit, and Glob. The specific tools matter because a constrained, purpose-built tool call is easier for the model to reason about. When you call Read with a specific path, the model’s intent is unambiguous and the application can apply permissions or sandboxing cleanly. When you call Bash with an arbitrary command, the intent is harder to audit and the surface area for unintended behavior is larger.

The description field in each tool schema is not cosmetic. The model uses it to decide when to invoke the tool and how to construct valid inputs. A vague description leads to incorrect tool selection and malformed inputs. A description that clearly states the tool’s purpose, its limitations, and the expected input format produces much more reliable behavior. Anthropic’s tool use documentation is explicit about this: the model relies on the description as its primary signal for tool selection.

Granularity also matters. One mega-tool that accepts a JSON payload describing any file operation is technically equivalent to five separate, narrowly scoped tools, but the model will reason about the five separate tools more reliably. The JSON Schema structure provides the scaffolding the model needs to construct valid inputs without guessing.

Parallel Tool Calls

Recent model versions, including Claude 3.5 Sonnet onward, can issue multiple tool calls in a single response. This means an agent can read several files at once rather than sequentially:

{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01",
      "name": "read_file",
      "input": {"path": "/src/index.ts"}
    },
    {
      "type": "tool_use",
      "id": "toolu_02",
      "name": "read_file",
      "input": {"path": "/src/utils.ts"}
    },
    {
      "type": "tool_use",
      "id": "toolu_03",
      "name": "read_file",
      "input": {"path": "/package.json"}
    }
  ]
}

You execute all three in parallel, then return all three results as tool_result blocks in the next message. Wall-clock time for a multi-file read drops from N sequential calls to the time of the single slowest one. For an agent doing substantial file exploration, this is a meaningful latency reduction.

The host application is responsible for handling parallel tool calls correctly. A naive implementation that processes tool use blocks sequentially and waits for each result before issuing the next call will work functionally, but leaves the parallelism benefit unrealized. Production implementations of tools like Claude Code and Cursor execute parallel tool calls concurrently and assemble the results before continuing the loop.

What Breaks

Understanding the loop makes the failure modes clear.

Context overflow is the most common. A session that reads too many large files, or runs commands producing verbose output, will exhaust available context. The model either produces degraded responses or the application must compress, truncate, or summarize, each of which discards information that may have been relevant.

Tool errors that feed back into context are a subtler problem. If a bash command fails with a long stack trace, that trace goes into context. If the model attempts a fix that also fails, the context accumulates error output. Long error chains crowd out the source code the model needs to reason about. A well-designed agent will detect repetitive failure and either truncate verbose error output before inserting it into context, or surface the problem to the user rather than continuing to spiral.

Model confusion under accumulated context is harder to measure but real. A model reasoning over 80K tokens of accumulated tool outputs, error messages, file contents, and partial attempts is doing something qualitatively harder than the same model with a clean 5K token context. The same task attempted with a fresh context will often succeed where a long-running session fails. This is an argument for keeping coding agent sessions focused, breaking complex tasks into shorter sub-sessions, and resisting the urge to let a single context window accumulate unbounded state.

The Loop Is Simple; the Surrounding Decisions Are Not

Implementing the basic agentic loop takes maybe fifty lines of Python. The Anthropic SDK’s streaming tool use examples show the full pattern clearly. What is not simple is everything around the loop: which tools to expose, how to describe them, how to manage context growth, when to summarize versus preserve raw output, and how to handle failures without spiraling.

Willison’s framing of the loop as the fundamental building block is correct and worth internalizing. Once you see it clearly, coding agents stop feeling like magic and start feeling like a design space. The model is not doing anything fundamentally different from any other text completion, it is just doing it repeatedly, with external state read through tools and accumulated in context. The craft is in the surrounding system: the tool schemas, the context management policy, the failure handling, and the decision about when to surface a problem to the user rather than attempting to resolve it autonomously.

Was this interesting?