· 5 min read ·

How Tool Schema Design Shapes Coding Agent Behavior

Source: simonwillison

Simon Willison’s guide to agentic engineering patterns describes the high-level structure of coding agents: an LLM that calls tools in a loop, reads results, and repeats until the task is done. That framing is accurate as far as it goes. What it does not fully surface is how much the design of those tools, and the scaffolding layer that dispatches them, determines what the agent can reason about at each step.

This post goes one level deeper: how tool schemas constrain model behavior, how different edit strategies trade off reliability against efficiency, and how context window pressure forces architectural choices that have nothing to do with the underlying model.

What the Model Sees When a Session Starts

When a coding agent begins a session, the LLM receives a system prompt, the user’s task description, and a list of tool definitions in JSON Schema format. Here is roughly what a file-reading tool definition looks like in an Anthropic API call:

{
  "name": "read_file",
  "description": "Read the contents of a file at the given path",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "The file path to read, relative to the project root"
      },
      "start_line": {
        "type": "integer",
        "description": "Optional: start reading from this line number"
      },
      "end_line": {
        "type": "integer",
        "description": "Optional: stop reading at this line number"
      }
    },
    "required": ["path"]
  }
}

That schema is not just documentation. The model uses it to reason about what is available and to generate syntactically valid tool calls. If start_line and end_line are defined, the model will use them to avoid loading entire files when it needs only a fragment. If those fields are absent, every read pulls the whole file into the context window. The tool schema is a design decision with real performance and cost consequences.

Claude Code exposes about a dozen tools in its default configuration: reading files, editing files, running bash commands, searching with grep and glob patterns, and a few auxiliary operations. The scaffolding, meaning the code that wraps the LLM API calls, controls which tools are offered, how their descriptions are phrased, and how results are formatted before being inserted into the context. This is where most of the product decisions live.

A fresh agent session has no prior knowledge of the project structure. The model has to build a mental map from scratch using whatever tools are available. Two dominant strategies exist: text search and structural search.

Text search means running grep-style queries to find files containing a class name, a function signature, or an import path. It is fast and language-agnostic, but it finds string matches rather than semantic relationships. The same identifier can appear in unrelated files, and a renamed function is invisible to a string search.

Structural search uses a syntax-aware tool, often backed by tree-sitter or a Language Server Protocol connection, to answer questions about code structure: find all callers of a function, show where a type is defined. Tree-sitter provides fast incremental parsing for over 100 languages without requiring a running compiler or LSP server. The tradeoff is setup complexity: the scaffolding needs language-specific grammar files and logic to dispatch structural queries correctly.

Most production coding agents default to text search because it works everywhere without per-language configuration. Claude Code uses grep and glob as its primary navigation tools. The model typically starts broad, listing directory structure and searching for entry points, then narrows down. This mirrors how a developer approaches an unfamiliar codebase, which is likely why it works reasonably well despite being semantically shallow. Structural queries would catch things that string matching misses, but the integration cost has kept most agents from shipping it by default.

The Three File Editing Strategies

File editing is where agent implementations diverge most clearly. Three main approaches dominate.

Full file rewrites are the simplest: read the file, generate the complete modified version, write it back to disk. This works for small files but becomes expensive for large ones. The model has to reproduce unchanged content, and any drift between what it read and what it writes risks corrupting the file.

Search-and-replace blocks ask the model to output an exact old string and its replacement. The scaffolding locates the occurrence in the file and patches it in place. Claude Code uses this approach. It is efficient but sensitive to exact matching: if the old string differs by a single whitespace character or a preceding line has changed, the edit fails and the agent has to retry with corrected context.

Unified diffs ask the model to produce output in standard patch format:

--- a/src/server.js
+++ b/src/server.js
@@ -42,7 +42,7 @@ function handleRequest(req, res) {
   const timeout = req.headers['x-timeout']
-  if (!timeout) return res.status(400).send('missing timeout')
+  if (!timeout) timeout = DEFAULT_TIMEOUT
   processRequest(req, timeout)

This format is compact and interoperable with standard tools like patch and git apply. The reliability problem is well-documented. Aider, which uses unified diffs as its primary edit format, has published benchmarks showing that models produce malformed diffs at a non-trivial rate, with significant variation across model families and sizes. Getting consistent valid patch output requires careful prompt engineering and often a repair step in the scaffolding that attempts to fix common formatting mistakes before applying the patch.

Context Window Pressure and Session State

Each tool call adds to the context: the model’s output, the tool’s response, and any subsequent reasoning. A long session accumulates file contents, bash output, grep results, and intermediate thoughts. Most agents handle this with some combination of truncation, summarization, and selective eviction of older content.

The challenge is that evicting content means the model loses track of what it already checked. It may re-read a file it already loaded, adding tokens without gaining information. Good scaffolding tracks session state and can inject a summary into the system prompt: which files have been read, which edits have been applied, what the current state of the task is.

At 200k tokens for current Claude models, the context window is large, but verbose bash output or a handful of large source files can consume tens of thousands of tokens quickly. Projects with deep file trees and large files push against this limit sooner than most developers expect, and agents that do not actively manage their context will slow down noticeably in the second half of a long session.

Why the Scaffolding Is the Hard Part

SWE-agent, the research system from Princeton that applies LLMs to real GitHub issues, quantified what practitioners have observed informally: the agent-computer interface design matters as much as model selection for task completion rates. Their published results documented how changes to tool descriptions and output formatting produced large swings in success rates, independent of which underlying model was used.

The reason is straightforward. The model reasons from whatever text is in the context window. The scaffolding controls what that text looks like. A tool that returns verbose, unstructured output trains the model to reason from noise. A tool that returns compact, well-labeled output gives the model cleaner signal. Swapping in a better model helps; redesigning the interface layer often helps more.

Most of the visible improvements in coding agent performance over the past year have come from better scaffolding rather than larger models: smarter context eviction, better edit format retry logic, tighter tool schemas with clearer parameter descriptions. The model is the reasoner; the scaffolding is the environment it reasons in. Both matter, but the environment is where most of the remaining headroom lives.

Was this interesting?