· 7 min read ·

From Tool Schema to File Edit: The Concrete Engineering of Coding Agents

Source: simonwillison

The agentic loop that powers coding agents works the same way across most implementations. A language model receives a context window containing a system prompt, conversation history, and accumulated tool results; it produces a response that includes zero or more tool calls; those calls execute in some environment; their output appends to the conversation; and the model runs again. Simon Willison’s guide to agentic engineering patterns covers this loop structure and many of the patterns that surround it.

The loop itself is not where the engineering complexity lives. The complexity lives in the tool schemas, the navigation strategies, the file editing mechanics, and the context management logic that surrounds every turn.

Tool Schemas as Behavioral Instructions

Every capability a coding agent has gets described to the model via a JSON schema. A basic file-reading tool might look like this:

{
  "name": "read_file",
  "description": "Read the contents of a file. Always call this before editing a file you haven't read in this session.",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "Absolute path to the file."
      },
      "offset": {
        "type": "integer",
        "description": "Line number to start reading from. Useful for large files."
      },
      "limit": {
        "type": "integer",
        "description": "Maximum number of lines to return."
      }
    },
    "required": ["path"]
  }
}

The description field functions as an instruction. Models have strong priors about common operations from training, but the description text shapes when and how a model chooses to call a tool. The sentence “Always call this before editing a file you haven’t read in this session” is a behavioral rule, not a technical specification. Anthropic’s tool use documentation recommends detailed descriptions specifically because they influence call frequency and sequencing.

This means writing tool schemas is a form of prompt engineering, and the effects compound over a multi-step task. An ambiguously described tool gets called at the wrong time; a well-described one becomes a more predictable building block. Many of the behavioral differences between competing coding agents trace back to schema design decisions rather than to model capability differences.

How Agents Navigate a Codebase

Before any file gets modified, the agent has to find the relevant code. There are four main approaches, each suited to different situations.

Exact path reads work when the model already knows where to look. That knowledge can come from the task description, from a previous search result, or from exploring a directory listing. Reading a known file is fast and deterministic, but it requires prior orientation.

Pattern-based search, usually backed by something like ripgrep, scans file contents for a string or regex. A tool like search_code(pattern="class AuthHandler", path="src/") returns matching lines with their file paths. The model can find a class definition, a function signature, or an import statement without knowing in advance which file contains it. The limitation is that raw line matches lack surrounding context; the model typically needs a follow-up read to understand the code around a match.

Glob-style file matching finds files by name pattern. A call like find_files("src/auth/**/*.py") or find_files("**/*test*") narrows the search space without inspecting file contents. Combined with grep, a model can orient itself in an unfamiliar codebase in two or three tool calls.

Language Server Protocol integration is the most powerful option. LSP tools answer semantic questions: find all call sites for this function, list all implementations of this interface, go to the definition of this symbol. The Language Server Protocol, originally designed to decouple editor intelligence from editor UI, gives a coding agent the same navigation capabilities a developer has in VS Code or IntelliJ. The cost is setup complexity and the requirement that the codebase be in a syntactically valid state. Projects like Pyright, rust-analyzer, and typescript-language-server provide LSP servers for the major languages.

Most production systems combine all four approaches. The agent chooses based on what it already knows about the codebase at each point in the task.

File Editing Strategies

Three main strategies exist for making changes to files, each with a different reliability profile.

Full file rewrite replaces the entire contents of a file in one tool call. The model generates the complete new version and the scaffolding writes it to disk. This is conceptually simple and easy to reason about, but it breaks down on large files. The model has to reproduce every line it is not changing, and reproduction errors accumulate with file length. A 50-line file rewrites reliably; a 500-line file often does not.

Search-and-replace takes an old_string and new_string parameter and performs an exact substitution. This constrains the edit to a specific region, which eliminates most accidental modifications to surrounding code. The failure modes are specific: old_string must exist exactly once in the file, and the model’s representation of the current file contents must match what is on disk. Stale mental models, caused by earlier edits in the same session that the model did not fully track, produce failures here. Claude Code’s Edit tool uses this approach and returns an explicit error when old_string appears zero or multiple times, which forces the model to re-read and reorient before proceeding.

Diff-based editing generates a unified diff and applies it programmatically. This is the most expressive format for representing complex changes, especially those touching multiple disjoint regions of a file. The problem is that generating a syntactically valid unified diff with correct context lines and offsets is harder than it appears; small formatting errors cause application failures and require recovery logic in the scaffolding. Models trained on code corpora have seen plenty of diffs, but producing them reliably under varied conditions is a different matter than recognizing them.

Many systems combine approaches: search-and-replace for targeted single-region edits, full rewrite for small files where reproduction errors are cheap, and diff for large multi-region changes when the scaffolding has reliable error recovery.

The Bash Tool

Most coding agents expose a terminal execution tool, typically called bash or execute_command. This tool is what distinguishes a coding agent from a code generator. Without execution, the model has no signal about whether its changes compile, pass tests, or produce the expected behavior. With execution, the agent can verify each step and correct errors in subsequent turns, forming a closed feedback loop.

The security surface is significant. An unrestricted bash tool can delete files, make network requests, install packages, or run indefinitely. Production deployments handle this with containerization, command allowlists, or purpose-built sandboxes. E2B and similar projects provide cloud sandbox environments designed specifically for agent execution. Anthropic’s guidance on computer use covers the threat model in detail and recommends treating the agent’s execution environment as untrusted by default.

Beyond security, there is the output volume problem. A test suite that runs for two minutes produces megabytes of output. Appending all of it to the context window burns tokens and displaces earlier, potentially more relevant, context. The standard approaches are output truncation (return only the last N lines or characters), selective capture (return only stderr on success, full output on failure), and summary injection (run a second model call to compress long output before feeding it back).

Context as a Depletable Resource

Over a long task, the context window accumulates tool results. File reads, search matches, command output, error messages, and model reasoning all compete for the same fixed budget. Research on transformer attention, including work published at transformer-circuits.pub, has shown that information near the boundaries of a long context receives more attention than information in the middle, which means context management affects output quality, not just token cost.

Systems that handle long tasks well treat context as a resource to be managed explicitly. They truncate tool results to a maximum token count before appending them. They summarize earlier conversation history and replace it with a compressed representation. Some reset the conversation periodically, injecting a task summary and the current state of modified files rather than the full conversation log.

The agents that perform well on multi-file refactors or long debugging sessions tend to be the ones with disciplined context management built into the scaffolding, not simply the ones with the largest available context windows. A 200k-token context window filled with unfiltered tool results is less useful than a 32k window managed carefully.

What This Means in Practice

If you are building a coding agent or evaluating one for production use, the tool schemas and scaffolding design deserve as much attention as model selection.

The description text in your schemas is behavioral configuration, and it should be written with the same care as a system prompt. Your file editing strategy determines failure rates on non-trivial changes, so choose it based on your expected file size distribution and how much recovery logic you want in the scaffolding. Your bash tool setup determines what runs in your users’ environments, and sandboxing should be part of the initial design rather than a retrofit. Your context management strategy determines whether the agent degrades gracefully as tasks grow longer, or whether it starts losing track of earlier decisions halfway through a multi-file change.

None of this is specific to any one model family; the patterns apply to any system built on the agentic loop, and the scaffolding decisions compound in ways that model upgrades alone cannot fix.

Was this interesting?