Tool Design Is the Hidden Variable in Coding Agent Performance

The basic architecture of a coding agent is not complicated. An LLM receives a task, calls a tool, observes the result, and loops until the task is done or the context fills up. Sebastian Raschka’s breakdown of coding agent components covers this loop clearly, and the components he identifies are real: the planning layer, tool execution, context management, and verification.

But what the survey view misses is that these components are not equally important, and they don’t contribute equally to failure. In practice, the design of the tools themselves, especially the file editing tool, is where most of the interesting engineering happens and where the differences between agents become concrete.

The Agent Loop in Practice

The standard loop most coding agents implement is a ReAct-style cycle: the model reasons about the current state, decides on an action, executes a tool call, and uses the result to inform the next step. The loop runs until the model emits a terminal response or hits a resource limit.

Here’s what that looks like in simplified pseudocode:

messages = [{"role": "system", "content": system_prompt}, {"role": "user", "content": task}]

while True:
    response = llm.complete(messages, tools=TOOLS)
    
    if response.stop_reason == "end_turn":
        break
    
    tool_results = []
    for tool_call in response.tool_calls:
        result = execute_tool(tool_call.name, tool_call.input)
        tool_results.append(result)
    
    messages.append(response)
    messages.append({"role": "tool", "content": tool_results})

Every major coding agent, including Aider, Claude Code, Cursor’s agent mode, OpenHands, and SWE-agent, implements some version of this. The loop itself is not a differentiator. The tools are.

The Edit Format Problem

The most consequential design decision in a coding agent is how it edits files. This sounds mundane, but it determines how often the agent succeeds on non-trivial tasks.

There are four main approaches in active use:

Whole-file rewrites. The agent reads a file and writes it back entirely. Simple to implement, easy for the model to get right, but burns context budget fast on large files and produces noisy diffs that are hard to review.

Unified diffs. The model outputs standard diff format:

--- a/src/server.ts
+++ b/src/server.ts
@@ -42,7 +42,7 @@
-  const timeout = 5000;
+  const timeout = 10000;

Compact and reviewable, but models make mistakes with line numbers and context lines, leading to diffs that fail to apply cleanly.

SEARCH/REPLACE blocks. Used by Aider, this format requires the model to output an exact string to find and the replacement:

<<<<<<< SEARCH
const timeout = 5000;
=======
const timeout = 10000;
>>>>>>> REPLACE

The key property is that the search string must match the file exactly. If the model hallucinates whitespace or a slightly different variable name, the edit fails. Aider addresses this with fuzzy matching fallbacks, but the fundamental brittleness remains whenever the model drifts from what is actually on disk.

str_replace with exact matching. This is the approach Claude Code uses. The tool schema takes an old_string and new_string. The tool applies the replacement only if old_string exists verbatim in the file. Ambiguous replacements, where the string appears multiple times, fail with an error that forces the model to provide enough surrounding context to uniquely identify the target location.

The difference between these formats is not cosmetic. The SWE-agent paper from Princeton (Yang et al., 2024) introduced the term “Agent-Computer Interface” (ACI) to describe this layer between the model and the filesystem, and demonstrated empirically that ACI design dramatically affects task success rates. Their custom interface, which included an interactive file editor with a visible line-number window and informative error messages designed for LLM recovery, outperformed a naive bash-based interface by a substantial margin on the SWE-bench benchmark, using the same underlying model.

The implication is that tool design is a first-class engineering problem, not scaffolding you bolt on after choosing a model.

Before editing, an agent needs to find the right files. A real codebase has thousands of files and millions of tokens of source code; a context window fits maybe a few hundred kilobytes. Agents handle this navigation problem differently.

Aider’s approach is a repo map: a compressed, tree-sitter-generated summary of the entire codebase that shows function signatures, class definitions, and call relationships without full implementations. The repo map fits in a fraction of the context budget and gives the model enough information to reason about what to read next. This is essentially static analysis used as a context compression tool.

Claude Code uses a more reactive approach: glob patterns, grep, and selective reads. Rather than pre-building a codebase summary, it relies on the model to issue targeted searches. This works well when the model knows what to look for and degrades when it doesn’t, because the search-read-search loop accumulates context quickly before any editing begins.

Cursor goes in a different direction with its background indexer: an embeddings-based semantic index that supports retrieval-augmented search. The agent can query for “authentication logic” and get back relevant file chunks without knowing filenames. This comes at operational cost, since the index needs to stay current, but handles large codebases where symbol names are opaque or the relevant code is spread across many small files.

None of these is universally better. The repo-map approach is excellent when the agent needs a global view of the codebase before deciding where to make changes. Reactive search is cheaper when the task is localized and the model can navigate by filename or symbol name. Semantic retrieval pays off when the codebase is large and the query is conceptual rather than syntactic.

Verification and the Self-Repair Loop

A coding agent that only edits files and never checks its work is not useful for serious tasks. Most production agents include some form of verification: run the test suite, invoke a type checker, execute a linter, or run the modified code and check exit codes.

The verification loop looks roughly like this: edit, execute, observe the output, and if it contains errors, re-enter the planning phase with the error as new context. This is what makes agents genuinely useful for debugging tasks, since they can iterate on a failure rather than producing a single-shot answer.

The fragility is in error interpretation. When a compiler emits cascading template errors or a test framework outputs a stack trace involving framework internals, the model needs to correctly attribute the root cause to the code it just wrote. Models handle this well for common languages and common error patterns, and reliably poorly for obscure toolchains or heavily abstracted error output where the failure site is many frames removed from the symptom.

Some agents implement smarter verification by running targeted rather than full test suites. Rather than running every test on every edit, which is expensive in a multi-iteration loop, the agent narrows the run to tests covering modified files. This is a practical tradeoff between confidence and context cost.

Context Pressure

The longest-running unsolved problem in coding agent design is context accumulation. The agent loop builds up messages with each iteration: tool calls, tool results, intermediate reasoning, file contents. A complex multi-file refactor can exhaust even a 200k-token context window before completing.

The main strategies for managing this:

Sliding window drops the oldest messages. Cheap to implement but risks losing critical earlier context, particularly the initial task description or prior file reads that informed later decisions.

Summarization periodically compresses older history into a condensed representation. More expensive but preserves semantic content across long sessions.

Subagents spawn a fresh context window for bounded subtasks and report back results. Claude Code’s Agent tool does this explicitly. The main agent maintains high-level state while delegating detail work to subagents with clean context windows, and the subagents never see the accumulation of prior turns.

Prompt caching allows expensive system prompts and stable file contents to be cached across turns, reducing both latency and token cost without any truncation. Anthropic’s prompt caching API makes this practical for agents that repeatedly reference the same large files.

The subagent pattern is architecturally interesting because it mirrors how large software projects are actually organized: a coordinator maintains the overall plan while delegating bounded tasks to specialists with narrower context. The tradeoff is that subagents cannot see the main conversation history, so the coordinator must explicitly summarize everything relevant when spawning them, which itself costs tokens.

What the Survey View Leaves Out

Raschka’s article gives you the right vocabulary for thinking about coding agents: there is a loop, there are tools, there is context management, and there is a verification step. That framing is accurate.

What it doesn’t capture is how deeply these layers interact. The edit format affects context consumption, since whole-file rewrites are expensive and str_replace edits are not, which in turn affects how many verification iterations are affordable within a fixed context budget, which affects task success rate on tasks that require more than one attempt. The codebase navigation strategy determines how much context is consumed before editing begins. Everything compounds.

The agents that perform best on benchmarks like SWE-bench are not necessarily using better base models. Comparative studies show that scaffolding design explains a significant portion of the performance gap between agents using identical models. That is the insight worth internalizing before building or integrating a coding agent: the model is largely a fixed input, but the tool schemas, edit formats, navigation strategy, and verification loop are the variables you actually control.