· 6 min read ·

The Scaffolding Decides: Why Coding Agent Tool Design Matters More Than Model Choice

Source: simonwillison

The core loop of a coding agent fits in a few lines of pseudocode:

while not done:
    action = llm.next_action(context)
    result = execute(action)
    context.append(action, result)

The model never touches the filesystem directly. It emits a structured description of what it wants to do; the scaffolding executes it, captures the output, and feeds it back as a new message. This is the ReAct pattern from Yao et al. (2022), and every serious coding agent, Claude Code, Aider, Cursor, Copilot Workspace, runs some variant of it.

Simon Willison’s guide to how coding agents work covers the loop clearly. What that framing tends to understate is how much the specific design of the tools, not the model behind them, determines whether the agent succeeds on a realistic task. The loop is the same. The engineering decisions inside the scaffolding are where the variance lives.

The Agent-Computer Interface

The 2024 Princeton paper SWE-agent introduced the term “Agent-Computer Interface” (ACI) to name this design layer explicitly. The paper’s central finding was striking: changes to tool descriptions and output formatting produced larger swings in benchmark performance than changing the underlying model. The same GPT-4, given a better-designed ACI, substantially outperformed itself given a generic one.

The mechanism is straightforward. Tool descriptions are not documentation for a developer reading source code. They are text injected into every prompt, read by the model at every inference call throughout a session. “Read a file” and “Read a file at an absolute path; always provide offset and limit for files longer than 200 lines; re-read before editing” produce observably different behavior over a 30-turn session. The description shapes what the model attempts, what it avoids, and how it recovers from errors.

SWE-agent found that file viewers should emit line numbers so the model can reference positions, search tools should return surrounding context rather than just file paths, and edit tools should operate on line ranges. Each of these reads like a minor implementation detail. Cumulatively they determine whether a model that can reason about code can also act on it reliably.

Edit Formats: The Clearest Illustration

The choice of how an agent edits files is the most consequential single ACI decision, and different systems have landed on meaningfully different approaches.

Full-file rewrite is the simplest: the agent reads a file, outputs the complete new version, the scaffolding writes it back. No ambiguity about edit location. The problem is cost and reliability. A 500-line file requires 500 lines of model output for a one-character change, running to several thousand tokens. More importantly, models hallucinate sections they did not read; a file they last saw 15 turns ago will have stale content incorporated into the rewrite. This approach works acceptably for small files and serves as a fallback, but it does not scale.

Unified diff is token-efficient and familiar to models trained on enormous quantities of diff output from open-source repositories. The model emits standard --- / +++ hunks; the scaffolding applies them with something equivalent to patch. The failure mode is that LLMs hallucinate line numbers. A single wrong character in a hunk header causes the entire patch to fail silently or apply to the wrong location. Aider documented this problem extensively and had to build a repair step into their scaffolding to fix common formatting mistakes before application. GitHub Copilot Workspace uses unified diff, which requires careful post-processing.

Search-and-replace blocks eliminate line numbers entirely by having the model quote the exact text it wants to replace. Aider’s format uses custom markers:

<<<<<<< SEARCH
def calculate_total(items):
    return sum(items)
=======
def calculate_total(items, tax_rate=0.0):
    subtotal = sum(items)
    return subtotal * (1 + tax_rate)
>>>>>>> REPLACE

Aider applies exact match first, then falls back to fuzzy matching via difflib.SequenceMatcher. The fuzzy fallback solves the case where the model quotes text with minor whitespace differences. It also introduces a failure mode: in files with repeated patterns, the fuzzy match can select the wrong location without signaling that anything went wrong.

Claude Code’s str_replace_editor tool takes a different contract. It accepts old_string and new_string as JSON parameters and requires old_string to appear exactly once in the file. If the string is not found, the tool returns an explicit error and the model retries with more context. If the string appears more than once and replace_all is false, the tool rejects the call and asks the model to disambiguate. There is no silent failure path:

{
  "tool": "str_replace_editor",
  "old_string": "function calculateTotal(items) {\n  return items.reduce((sum, item) => sum + item.price, 0);\n}",
  "new_string": "function calculateTotal(items, taxRate = 0) {\n  const subtotal = items.reduce((sum, item) => sum + item.price, 0);\n  return subtotal * (1 + taxRate);\n}"
}

Exact-match-or-reject is better engineering than fuzzy fallback not because it succeeds more often on the first try, but because its failures are always visible. The model gets an error, reasons about what went wrong, and tries again with a longer context string. Fuzzy matching succeeds silently at the wrong location, producing a broken file that may not fail until tests run several turns later.

The apply model is Cursor’s approach with Instant Apply: the primary reasoning model describes the change at a high level; a separate, smaller model generates the actual file edit. This keeps the reasoning model focused on understanding the task rather than producing syntactically correct output for a specific file. The tradeoff is added latency and coordination complexity between two inference calls.

Aider publishes edit format benchmarks showing that the winning format varies by model. What works best for Claude does not necessarily work best for GPT-4o or Gemini 2.5. Edit format is a per-model engineering decision, not a universal standard.

Context Window Arithmetic

The context window is the only working memory an agent has, and it fills faster than most people expect. Production system prompts consume 5,000 to 10,000 tokens before any task starts. Tool definitions add more. A verbose bash command can dump 20,000 to 50,000 tokens in a single result. A single large source file runs 2,000 to 5,000 tokens.

Liu et al. (2023) documented the lost-in-the-middle effect: models perform measurably worse on information placed in the middle of long contexts compared to the beginning or end. This has direct operational consequences. Constraints introduced mid-conversation degrade in reliability over long sessions. Content loaded at session start in a file like CLAUDE.md retains attention throughout. Instructions added at turn 15 of a 40-turn session are less reliably followed than the same instructions in the system prompt.

The practical implication for the agent loop is that the model must re-read the relevant file section immediately before editing, even if it read the same file five turns ago. Earlier edits may have already modified it. An agent that edits from a stale mental model will produce diffs that fail to apply or produce broken output.

What the Benchmarks Show

SWE-bench measures whether an agent’s patch causes the associated tests to pass on 2,294 real GitHub issues. Early GPT-4 baselines with minimal scaffolding scored 2 to 4 percent. Current state-of-the-art systems score 50 to 60 percent on the verified subset. The same models in simpler scaffolding frameworks score 18 to 22 percent. The gap between the model’s ceiling and its floor is largely determined by tool design: edit format, output truncation, context management, and the verification loop that runs tests and feeds failures back.

The verification loop matters because it is what distinguishes a coding agent from an autocomplete tool. The agent makes an edit, runs tests, reads the failure output, reasons about what went wrong, and edits again. That cycle requires a shell execution tool that captures output even on failure, an edit tool that fails loudly on bad inputs, and a context management strategy that keeps enough room for the failure message to be read. All three are scaffolding problems, not model problems.

When evaluating or building with coding agents, the model is the most visible variable and often the least important one to optimize. The tool schema, the edit format, the output truncation strategy, and the session initialization content collectively determine more of the outcome. The scaffolding is the product.

Was this interesting?