· 6 min read ·

The Tool Schema Is the Real API of a Coding Agent

Source: simonwillison

Something that gets glossed over in most writing about coding agents is how mechanical the core loop actually is. Simon Willison wrote a detailed guide on how coding agents work, covering the architecture at a useful level of abstraction. What I want to focus on is one level lower: the tool schema design layer, which is where most of the behavioral complexity actually lives.

The Loop Is Simple

At its core, the agentic loop has three steps: the model generates a response; if that response contains tool calls, they get executed; the results come back as new messages and the loop repeats. This continues until the model produces a response with no tool calls, or an external stop condition fires.

The conversation format that enables this is standardized across the major model APIs. In the Anthropic Messages API, a tool call from the model looks like this:

{
  "role": "assistant",
  "content": [
    {
      "type": "tool_use",
      "id": "toolu_01XFDUDYJgAACTJJiqiB97k5",
      "name": "read_file",
      "input": { "path": "src/main.rs" }
    }
  ]
}

The caller executes the tool and returns the result as a user message:

{
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "toolu_01XFDUDYJgAACTJJiqiB97k5",
      "content": "fn main() {\n    println!(\"hello\");\n}\n"
    }
  ]
}

Then the model gets called again. The model doesn’t maintain state between calls; it re-reads the full conversation on every turn. State lives in the conversation history, not in the model.

Tool Schemas Define Behavior

Each tool is described to the model as a JSON Schema object with a name, description, and input parameter specification. This is not documentation in the conventional sense. The model uses the schema to reason about what operations are available and when to reach for them. A poorly chosen tool name or an ambiguous description translates directly into misuse.

Consider the design decision between a single bash tool that executes arbitrary shell commands versus a set of narrower tools: read_file, write_file, list_directory, search_files. The bash approach is maximally flexible. It also means the model has to generate correct shell syntax, reason about PATH and working directory, and avoid side effects the user didn’t ask for. Claude Code exposes both: a Bash tool for operations that need shell execution, alongside dedicated Read, Write, Edit, Glob, and Grep tools for common file operations. The dedicated tools have tighter schemas, which means more predictable behavior and a cleaner audit trail.

Aider takes a different approach to the edit layer. Rather than giving the model a write-file tool, it teaches the model to output SEARCH/REPLACE blocks, which Aider’s own code applies as patches:

src/main.py
<<<<<<< SEARCH
def old_function():
    return 1
=======
def old_function():
    return 2
>>>>>>> REPLACE

The model never calls a write tool; it describes changes in a structured text format that the tool layer applies. This design moves error handling out of the model and into Aider’s patch application code, which can use fuzzy matching as a fallback when exact matching fails. The trade-off is that the model must reproduce the SEARCH section verbatim, which it occasionally gets wrong on large or syntactically dense blocks. Aider’s edit format benchmarks show SEARCH/REPLACE uses roughly 40 to 60 percent fewer tokens than whole-file rewrites on typical edit tasks, because only the changed region needs to be sent rather than the full file contents.

Context Is the Budget

Every tool call consumes tokens in both directions. The tool_use block going out costs tokens; the tool_result coming back costs tokens; both persist in the conversation for the remainder of the session. A 200k-token context window sounds generous until you’re reading source files. A 1000-line Python file costs roughly 4,000 to 6,000 tokens as a tool result. Reading ten such files consumes 40,000 to 60,000 tokens before the agent has written a single character of output.

This has a direct implication for tool schema design: the granularity of information returned shapes the token efficiency of every session that uses the tool. A search_files tool that returns 50 matches with full surrounding context per match will saturate context on any moderately complex search. A tool that returns file paths with match counts first, then lets the model request detailed context only where needed, will not. The difference isn’t in the model; it’s in what the schema returns.

Claude Code handles this through tool-level truncation: Read caps output at 2,000 lines by default, with offset and limit parameters for slicing into large files. The Grep tool has a head_limit parameter. When context approaches saturation, Claude Code triggers a compaction step: it asks the model to summarize the conversation history into a structured state summary, replaces the full history with that summary, and continues. The quality of that summary determines how well the agent recovers. Summaries that lose failure history cause the agent to retry approaches that already failed, which costs more context and produces no progress.

Some rough numbers from community benchmarks: a simple bug fix involving two or three file reads and a couple of edits consumes around 5,000 to 15,000 tokens total. A feature addition touching five to ten files across multiple iterations runs 20,000 to 60,000. A hard SWE-bench instance can hit 150,000 tokens without difficulty. Hitting the 200k limit on complex tasks is common, which makes compaction strategy a first-class architectural concern rather than an edge case.

Parallel Calls Are Underused

The Messages API supports parallel tool calls: the model can emit multiple tool_use blocks in a single response, and the caller executes them concurrently. All results come back together in a single user message, one tool_result block per original call. When reading five independent files sequentially, the agent waits for each result before issuing the next call. When reading them in parallel, all five results arrive in one turn. The token cost is identical; the latency is not.

In practice, current models don’t parallelize as aggressively as they could. The most common pattern is sequential even when calls are logically independent, likely because training data skews sequential or because the model hedges on whether the next call genuinely depends on the current result. Getting an agent to batch independent reads consistently is mostly a system prompt problem: explicit instructions to parallelize independent operations produce measurably more efficient sessions.

The System Prompt Is the Configuration Layer

The system prompt is where the agent’s operating logic lives: which tools to prefer for which operations, how to handle errors, when to ask for clarification versus proceeding autonomously, and what scope constraints apply. It is also where behavioral conventions get encoded that the base model wouldn’t infer on its own, like “always read a file before editing it” or “prefer targeted edits over full rewrites.”

The behavior difference between a useful coding agent and a frustrating one usually lives here. The same model with the same tools will behave very differently depending on how its system prompt handles failure recovery. A system prompt that lacks explicit error recovery guidance produces an agent that retries failed commands without investigating root causes. One that lacks scope constraints produces an agent that modifies files it wasn’t asked to touch. These failure modes look like model problems but they’re prompt design problems.

Willison has written about the importance of making agents produce observable reasoning traces, logging what tools were called with what arguments and what was returned. That observability is what makes system prompt iteration tractable: you can see where the agent’s behavior diverged from intent and write a more explicit instruction to cover that case.

Building Agents

If you’re building a coding agent on top of the Claude API, most of the work is scaffolding: a conversation loop, a tool executor, a mechanism to surface errors back to the model as tool_result blocks with is_error: true rather than crashing the executor, and some strategy for context management. The model handles reasoning; you provide the environment.

The interesting decisions are in the tool layer. What operations to expose, how narrowly to scope them, what information to return and in what format, and how failures surface to the model. A tool that throws a Python exception on error will crash the executor; a tool that returns an error string as the tool result lets the model handle recovery. These are not model problems. They are API design problems, where the model is the consumer and the schema is the contract.

Was this interesting?