The Scaffolding Problem: How Tool Design Shapes Coding Agent Behavior

Sebastian Raschka recently published a thorough breakdown of coding agent components that got significant traction on Hacker News. The article is a good survey, but the part that deserves more attention is the implication buried in the architecture diagrams: the model is one variable among many, and often not the most important one. The tooling, the edit format, the context strategy, and the verification loop together determine whether an agent produces usable output or drifts into a failure mode. This post goes deeper on those specifics.

The Core Loop Every Coding Agent Runs

Stripped down, every coding agent runs the same loop:

while task_not_done:
    context = gather_relevant_files()
    action = model.predict(system_prompt + history + context)
    result = execute_tool(action)
    history.append((action, result))
    if action == "done":
        break

What varies is what goes into gather_relevant_files(), what tools are available in execute_tool(), how history is managed as it grows, and what happens after an edit is applied. These choices compound over a multi-turn task. A bad edit format means the model generates malformed patches. Poor file navigation means irrelevant files bloat the context. Missing verification means errors propagate silently.

The ACI Layer: Where Most of the Engineering Lives

The SWE-agent paper from Princeton (Yang et al., 2024) coined the term Agent-Computer Interface (ACI) to describe the layer between the model and the environment. The finding was that the same underlying model, given different tool schemas and interface designs, produces benchmark scores that vary by 15-20 percentage points on SWE-bench. That is a larger effect than switching between models of similar capability.

The tool schema design is where this effect originates. Consider the basic problem of editing a file. There are three common approaches:

Whole-file replacement: The model reads a file and writes back the entire modified content. Simple to implement and parse. Fails badly on files over a few hundred lines because models hallucinate content they did not read, and the approach wastes context budget on unchanged code.

Unified diff: The model generates a patch in standard diff format. Compact and expressive, but models generate syntactically invalid diffs frequently, particularly around line number accounting and context lines. Applying a malformed diff requires a fuzzy-apply algorithm, which introduces its own failure modes.

Search/replace: The model specifies an exact string to find and the replacement content. Claude Code uses this as its primary edit mechanism, with old_string and new_string fields. The key design constraint is that old_string must match exactly and be unique in the file. This forces the model to reproduce existing code accurately, which in practice means it reads more carefully before editing.

Aider uses a text-level variant of search/replace where the model emits structured blocks in its response:

path/to/file.py
<<<<<<< SEARCH
def compute_total(items):
    return sum(items)
=======
def compute_total(items, tax_rate=0.0):
    return sum(items) * (1 + tax_rate)
>>>>>>> REPLACE

Aider’s approach predates widespread tool-use support in APIs and works well with models that generate coherent text but are unreliable with structured JSON tool calls. The tradeoff is that parsing is fragile; the model must produce exact fence syntax, and any deviation breaks the application step.

Cursor takes a different direction: it uses a small secondary model fine-tuned specifically to apply diffs. The primary model generates intent; the apply model translates that intent into a valid patch. This separates concerns cleanly but adds infrastructure complexity and a second model cost on every edit.

Paul Gauthier’s benchmark work on Aider has documented this extensively. Edit format selection moves scores on SWE-bench by roughly 5-15 points independent of model, which confirms what the SWE-agent paper found through different methods.

A large codebase contains more code than fits in any context window. Navigation strategy determines which subset the model sees, and a bad subset leads to confident but wrong edits.

The four main approaches used in production agents:

Grep-first navigation: The agent issues search calls to find relevant symbols before reading files. Claude Code defaults to this pattern: search for the function name, identify which files contain it, then read those files. This is cheap and effective for targeted changes but degrades on tasks that require broad codebase understanding.

LSP-backed semantic search: Cursor maintains a live language server index. The agent can query for all references to a symbol, find the definition of a type, or navigate import graphs semantically rather than through text search. This is more powerful but requires running a language server alongside the agent, which adds complexity and limits portability.

Repository maps: Aider generates a compressed map of the entire repository using ctags or tree-sitter, extracting class names, function signatures, and file structure into a token-efficient summary. This map (typically 2,000-8,000 tokens depending on repo size) stays in context across the entire task. The model has global orientation without needing to read every file. The cost is the map tokens on every turn.

Embedding search: Some agents, including variants of Devin and Copilot Workspace, embed the codebase and retrieve semantically relevant chunks. This handles cases where the relevant code is structurally distant from the search query. The tradeoff is infrastructure overhead and the risk of embedding a stale index.

None of these approaches is dominant across all task types. Grep-first is fast and sufficient for surgical edits. Repo maps handle exploratory refactors better. Embedding search handles the case where you do not know what you are looking for. Most sophisticated agents combine strategies.

Verification Loops Close the Quality Gap

The difference between a coding agent that is occasionally useful and one that can run autonomously on a ticket comes down to verification. After an edit is applied, what happens?

The spectrum runs from nothing (apply and move on) through syntax checking and linting, up to full test execution with iterative repair. The last option, where failing test output is fed back into the loop as a new observation, is what the SWE-bench evaluation setup uses and what most benchmark leaders implement.

Claude Code handles this through Bash tool access. The model can run npm test, pytest, cargo test, or any other command and see the output. When a test fails, the failure message becomes part of the conversation history and the model attempts a repair. This is prompt-orchestrated rather than hard-coded: the system prompt instructs the model to run tests after edits when a test suite is present and to read error output carefully before retrying.

Devin’s architecture emphasizes long-horizon repair more aggressively, running for many turns on a failing test suite and using web search to look up unfamiliar error messages. The capability is genuine but comes with a failure mode: the agent can spin in a repair loop, making incremental changes that do not converge, consuming significant time and cost.

The verification design question is not just “does the agent run tests” but also “how does it decide when to stop.” A model that runs tests until they pass is correct in theory but requires some circuit-breaker for cases where the tests are flawed, the task is misspecified, or the repair loop is diverging.

Context Management as the Unsolved Problem

As a task runs across many turns, the conversation history grows. Tool outputs, file contents, and intermediate edits accumulate. Most agents handle this through some combination of truncation (dropping oldest messages), tool output summarization, and selective re-injection of important context.

Claude Code pins the original task description and truncates verbose tool outputs (bash commands that produce large outputs are trimmed to head and tail). Aider keeps the repo map in context and manages file contents as explicit attachments that can be added or removed. Devin uses summarization more aggressively for long tasks.

The hard case is a task that requires information established early in the conversation to make a decision late in the task. Truncation-with-recency-bias loses exactly this kind of early context. There is no fully satisfying solution to this in current systems; it is one of the primary reasons coding agents struggle more on long multi-file refactors than on targeted single-file fixes.

What This Means in Practice

Raschka’s article is worth reading as a map of the territory. The deeper takeaway is that evaluating a coding agent by the underlying model name misses most of the interesting variation. Two agents using the same model but different edit formats, navigation strategies, and verification loops will produce substantially different results on real tasks.

For anyone building on top of these systems, the implication is that the system prompt and tool schema design are the primary levers. The SWE-agent research provides concrete evidence for this and is worth reading alongside Raschka’s overview. Aider’s edit format leaderboard posts show what ablations look like in practice.

The model matters. The scaffolding matters more than people typically assume.