From Tool Schema to File Edit: The Engineering Decisions Inside Coding Agents

Sebastian Raschka’s breakdown of coding agent components covers the full engineering stack: how agents navigate unfamiliar codebases, how they edit files without corrupting them, how they manage growing context windows, and how they verify their own work. Each of these encodes design decisions that compound into meaningfully different outcomes. The one worth examining most closely is the edit format, because it illustrates a broader principle that applies to the entire tool layer.

The Agent Loop

The structural foundation is a reason-act cycle. The model receives a task, the accumulated history of prior tool calls and their outputs, and the current context window state. It generates either a tool call or a task-complete signal. The scaffolding executes the tool, appends the result to the context, and loops.

while not done:
    action = llm.generate(system_prompt + context_window)
    result = execute_tool(action)
    context_window.append(result)

This basic loop appears in nearly every serious coding agent: Claude Code, Aider, Cursor, and the Princeton research prototype SWE-agent. The differences come from what lives inside that loop: which tools the agent has access to, how those tools are designed, and what context management strategy handles the growing history.

Navigating a Codebase

Before an agent can modify anything, it needs to find the right code. A typical agent has a hierarchy of navigation tools: directory listing, glob matching, text search, file reading with optional line ranges, and sometimes LSP-backed symbol lookup.

Grep-style search handles the common case well. Finding all callers of a function, locating the definition of a class, or identifying which files import a particular module is faster and more precise with exact text search than with embedding-based semantic search. Semantic search is useful for orientation when the task is vague, but for targeted modification tasks, ripgrep is usually the right tool.

Aider developed one of the most effective context strategies here: the repo map. Using tree-sitter to parse the codebase, Aider extracts all function and class signatures across every file and injects a compressed version into the model’s context. The model gets a structural overview of the entire repository without reading every file, which gives it enough orientation to make targeted reads rather than broad sweeps. The cost in tokens is low; the benefit to planning quality is substantial.

The SWE-agent paper from Princeton found that a purpose-built scrollable file viewer with 100-line windows outperformed whole-file reads in their experiments. The constraint was intentional: forcing the agent to navigate a file keeps context usage bounded and attention focused. A model reading a 2,000-line file when it needs lines 800 through 850 is burning context on noise.

The Edit Format Problem

This is where coding agent design gets technically consequential. Three main approaches exist for file editing, with meaningfully different reliability profiles.

Whole-file replacement is straightforward to implement. The agent reads a file, produces a complete updated version, and writes it back. For a 500-line file with a three-line change, the model must generate all 500 lines, any of which can be subtly hallucinated. A corrupted line elsewhere in the file produces a bug unrelated to the intended change, and the model often cannot detect it.

Unified diff format appears to be the right answer. Standard patch-style diffs are token-efficient and express precise intent:

--- a/src/auth.py
+++ b/src/auth.py
@@ -47,7 +47,7 @@
 def verify_token(token: str) -> bool:
-    return jwt.decode(token, SECRET)
+    return jwt.decode(token, SECRET, algorithms=["HS256"])

The problem is that models produce malformed diffs at a rate that makes them unreliable in practice: wrong line numbers in the @@ header, incorrect context lines, off-by-one errors in hunk ranges. When a diff is malformed, the patch command either fails or, worse, applies incorrectly and silently corrupts the file. Silent corruption is the worst outcome because the agent continues operating on code that is already wrong.

Search-and-replace blocks are what the industry has largely converged on:

{
  "tool": "edit_file",
  "path": "src/auth.py",
  "old_string": "return jwt.decode(token, SECRET)",
  "new_string": "return jwt.decode(token, SECRET, algorithms=[\"HS256\"])"
}

The model specifies exact text to find and exact text to replace it with, requiring no line numbers. The applying layer does the search. If old_string is absent from the file, the operation fails explicitly with an error the model can read and correct, eliminating the silent corruption mode that makes malformed-diff failures so hard to diagnose. Claude Code uses this format; Aider settled on it after testing multiple alternatives; the SWE-agent shell-based editor implements a close variant.

Aider’s benchmarking across its supported edit formats showed the search/replace approach substantially outperforming unified diff on SWE-bench tasks. The likely reason: producing “here is the text to find, here is what to replace it with” aligns with how the model reasons about code changes. Producing a correctly-numbered unified diff requires simultaneous reasoning about line counts, context lines, and hunk headers, which is a harder generation task for the same underlying model.

The remaining edge case is ambiguous matches: when old_string appears more than once in the file. The solution is requiring the model to read the file before editing and provide enough surrounding context in old_string to make the match unique. This is solvable, but it makes the navigation step non-optional.

The Agent-Computer Interface

The SWE-agent paper introduced the term “Agent-Computer Interface” (ACI) to describe the full layer of tools, schemas, and output formats between the LLM and the system it operates on. The analogy to GUIs is deliberate: a graphical interface is optimized for human perceptual and cognitive patterns; the ACI should be optimized for how language models process information and generate structured output.

The practical implications extend well beyond edit format. Tool description quality is load-bearing. The model decides which tool to call based almost entirely on the natural language description in the tool schema. A vague or ambiguous description produces wrong tool selection, and tool selection errors compound across a long task. The output format of each tool shapes the model’s ability to plan the next step: file reads that include line numbers make subsequent edits easier to specify; search results that include context lines let the model confirm it found the right location before committing to a change.

Tool error handling belongs here too. When old_string is not found in the target file, the tool should return a message that tells the model what happened and suggests checking the file contents first. When a bash command exits non-zero, the tool should capture and return both stdout and stderr. When a file path does not exist, the error message should distinguish between a wrong path and a file that genuinely has not been created yet. These details determine whether the agent can self-correct or spirals into repeated attempts at the same failed action.

The Princeton team demonstrated that improving the ACI alone, without changing the underlying model, produced substantial gains on SWE-bench. This is an important empirical result: the scaffolding layer is not incidental plumbing around a capable model, it is a primary determinant of performance. A weaker model with a carefully designed ACI can outperform a stronger model with a poorly designed one.

Verification Closes the Loop

An agent that cannot observe the effects of its edits is generating changes it cannot evaluate. The most effective agents can execute the repository’s test suite after each edit:

edit file → run_tests() → read failure output → edit again

Agents with test execution access substantially outperform those without it on SWE-bench, which measures exactly this capability: produce a patch that makes failing tests pass, validated against the actual test suite. Reading “your change caused three test failures, here is the stack trace” and iterating is what makes multi-step repairs tractable. Without that feedback, the agent must reason from first principles about whether each edit is correct, which is considerably harder.

Finer-grained verification tools extend this further. Running a linter after an edit catches syntax errors before the test suite is invoked. Running a type checker catches a different class of problems. Each verification tool provides an additional feedback signal the agent can use to converge on a correct solution. The pattern is the same at every granularity: edit, observe the result, adjust.

Implications

The progression from roughly 13% on SWE-bench Lite when Devin first benchmarked in early 2024, to over 70% on SWE-bench Verified for Claude Code and comparable systems by early 2025, reflects improvements at every layer: better base models, better edit formats, better navigation tooling, better verification integration. Benchmark scores compress these contributions into a single number; the engineering behind them is where the consequential decisions live.

For anyone building agent workflows, the implication is that tool design warrants serious attention alongside model selection. A search/replace edit tool with clear error reporting, a compressed repo map for codebase orientation, grep for targeted navigation, and a test runner for verification will produce more reliable results than swapping between frontier models while leaving a poorly designed ACI in place.

Raschka’s article organizes these components clearly and is worth reading as a map of the full system. The point it implies throughout is that each component encodes an interface contract between the model and the environment, and the quality of those contracts compounds across a long multi-step task. Getting the edit format right carries measurable performance consequences at benchmark scale, which puts it in the same design conversation as model selection and context strategy, not in the category of implementation details to be settled later.