The Agent-Computer Interface: Why Scaffolding Explains More Than Model Choice in Coding Agents
Source: simonwillison
Coding agents are built on a deceptively simple foundation: a loop. The model receives context, picks an action, the scaffolding executes that action, the result goes back into context, and the cycle repeats until the task is done or the model gives up. Simon Willison’s guide to agentic engineering patterns lays this out clearly, and it is a useful starting point. What the loop description leaves open is the harder question: why do agents built on the same underlying models perform so differently from each other?
The answer sits in the scaffolding. Princeton’s SWE-agent paper introduced the term Agent-Computer Interface (ACI) to describe this precisely. The insight is that an LLM agent interacting with a computer needs a well-designed interface just as much as a human user does. Poor tool design, ambiguous tool descriptions, and output formats that waste tokens all degrade agent performance independently of how capable the underlying model is. On SWE-bench Verified, a benchmark of 2,294 real GitHub issues from well-maintained Python repositories, top systems with well-engineered scaffolding achieve 49-60% resolve rates as of early 2026. The same models in simpler frameworks sit around 18-22%. Scaffolding quality explains most of that performance spread.
The Loop
Every coding agent runs a variant of the pattern formalised by the ReAct paper (2022):
while not done:
action = llm.next_action(context) # returns a structured tool call
result = execute(action)
context.append(action, result)
The model never touches the filesystem directly. It emits a JSON object describing what it wants to do; the scaffolding dispatches that call, captures the output, and appends both the call and the result to the conversation. With Anthropic’s API, a completed tool call produces a stop_reason: "tool_use" and a tool_use content block; the scaffolding wraps the result as a tool_result block and makes the next API call. Each iteration adds at minimum two entries to conversation history, which means context fills up at a predictable rate throughout a session.
The ReAct insight worth keeping in mind: the model’s verbal reasoning before a tool call, the chain-of-thought trace where it says “I need to find where authentication is handled,” is not decoration. It improves the accuracy of the subsequent tool call. This is why Claude Code streams thinking in real time; the output you see is performing real work on the next action’s quality.
What Tool Design Controls
The ACI framing makes explicit something easy to miss: the description field in a tool’s JSON schema gets read at every inference call. It shapes model behaviour at every step of every session. “Read a file” and “Read a file at an absolute path; use offset and limit to read large files in sections; always re-read before editing” produce observably different agent behaviour over a 30-turn session.
The SWE-agent paper identified concrete properties of well-designed tools for code editing: file viewers should emit line numbers so the model can reference specific positions, search tools should return surrounding context rather than just file paths, edit tools should operate on line ranges, and shell execution should capture output even on failure, because the error message is what the model needs to self-correct.
Claude Code takes a different approach, relying heavily on direct bash execution and trusting a capable model in a real Unix environment rather than providing a tightly constrained ACI. Both approaches achieve strong results. The ACI approach is more structured and predictable; the bash-heavy approach is more flexible and handles edge cases without scaffolding changes.
Navigation: Four Strategies
How an agent finds the right code to change is one of the most consequential scaffolding decisions, and different tools make different bets here.
Iterative grep-and-glob is Claude Code’s default. Start from an anchor (an error message, a failing test, a file name), trace outward through imports and references. No infrastructure required, works on any language. The cost is roughly 15-20 tool calls before the model has enough context to edit confidently on a moderately complex codebase.
Repository maps are Aider’s approach. At session start, Aider uses tree-sitter to parse the entire codebase and extract function signatures, class definitions, and cross-file references, producing a compact text map typically 1,000-8,000 tokens that is injected into every prompt. The map is dynamically sized using a PageRank-style algorithm that prioritises recently mentioned or edited files as conversation history grows. The limitation is that tree-sitter gives syntax-level information only, not semantic, so it cannot distinguish a thin wrapper from a substantive implementation.
Embedding-based semantic search is what Cursor uses with its @codebase command. Source files are chunked, embedded using a text-embedding model, and stored in a vector index. Queries retrieve chunks by cosine similarity, which finds conceptually related code even when naming differs. The trade-offs are a synchronized vector index to maintain, false positives from domain-vocabulary overlap, and added retrieval latency.
LSP queries give exact go-to-definition and find-all-references results with no false positives. Claude Code exposes an LSP tool for typed languages; Cursor runs language servers continuously since the editor already requires them. LSP fails on mixed-language repos or non-standard setups, falling back to grep. For TypeScript and Go on a well-configured project, it is substantially more accurate than grep on shared identifiers.
Production agents are converging on combinations of these strategies. Each has different failure modes, and a fallback chain that degrades gracefully handles more real-world cases than any single approach.
Editing: Three Approaches
String replacement is Claude Code’s primary edit mechanism. The model emits an old_string and new_string; the scaffolding does a literal string search and replace. If old_string is not found, the tool returns an explicit error and the model retries with more surrounding context. If old_string appears more than once and replace_all is false, the tool rejects the call, demanding disambiguation. This is deterministic and auditable, but brittle: a single incorrect character breaks the match, and imprecise model recall of content read several turns ago causes failures.
SEARCH/REPLACE blocks with fuzzy matching is Aider’s approach. Scaffolding tries exact match first; on failure it falls back to Levenshtein distance via difflib.SequenceMatcher. This handles minor whitespace differences and slightly misremembered variable names. The cost is non-determinism: fuzzy matching can select the wrong location in files with repeated patterns. Aider benchmarks edit formats against different models at aider.chat/docs/leaderboards; the winning format varies by model, which is itself an ACI insight, tool design interacts with specific model capabilities rather than generalising cleanly.
The apply model architecture is what Cursor uses. The primary reasoning model describes the change at a high level; a separate smaller model generates the actual file edit. This keeps the reasoning model focused on logic rather than the mechanics of code generation.
One discipline applies to all three: the model should re-read the relevant section of the file immediately before generating an edit, not rely on content it read ten turns ago. Files change during a session as earlier edits land. An old_string built from stale memory produces near-misses that fail on match. Claude Code’s Read tool accepts offset and limit parameters specifically to make reading a narrow section of a large file cheap.
Context Window as Working Memory
The context window is the agent’s only working memory. System prompts for production agents consume 5,000-10,000 tokens before a task starts. Tool definitions and behavioral instructions add more. Every tool call adds at least two entries.
The lost-in-the-middle effect from Liu et al. (2023) has a direct operational implication: models perform measurably worse on information placed in the middle of long contexts compared to the beginning or end. CLAUDE.md and system prompts load at session start, at the highest-attention position. Constraints introduced mid-conversation as natural language land in the middle and degrade in reliability over long sessions.
When Claude Code’s context approaches its limit, it compacts: the LLM summarises the conversation, the session restarts with that summary, and CLAUDE.md is re-injected. Mid-session natural language constraints do not survive in their original form. This is a concrete argument for encoding constraints in CLAUDE.md or in PreToolUse hooks rather than in conversational instructions, because those two mechanisms survive compaction and session boundaries while conversational instructions do not.
The hook distinction is worth spelling out. A CLAUDE.md instruction is advisory; the model follows it most of the time, especially early in a session. A PreToolUse hook that exits non-zero is enforced regardless of the model’s reasoning:
#!/bin/bash
FILE=$(python3 -c "import sys,json; print(json.load(sys.stdin).get('file_path',''))")
if echo "$FILE" | grep -q '/migrations/'; then
echo 'Blocked: migrations directory requires explicit confirmation'
exit 1
fi
For hard constraints in autonomous or CI contexts, this is the reliable mechanism. Natural language prohibition in a system prompt will hold for twenty turns and then slip.
What This Means When Building
If you are building or evaluating a coding agent, the productive frame is not which model is best but what the scaffolding does well. Open-source agents like SWE-agent and Aider are roughly 20% model interaction and 80% scaffolding: process management, state tracking, tool routing, error propagation, cost tracking, and interruption handling.
The performance gap between naive bash access and a carefully designed ACI on SWE-bench is a scaffolding gap. The quality of your tool descriptions, the precision of your edit format, how faithfully you propagate error output back to the model, and the position of critical information in the context window are all engineering decisions with measurable performance consequences. They compound across a session in ways that individual failures make hard to attribute. That compounding is what separates a tool that resolves 20% of issues from one that resolves 55%.