The Scaffolding Is Most of the Agent

Simon Willison published a thorough guide on how coding agents work that covers the full stack from tool schemas to context compaction. The core insight buried in the technical details is worth stating plainly: in a coding agent, the model is roughly 20% of the code. The scaffolding around it is everything else.

That scaffolding is doing far more work than it appears to from the outside.

The Loop

Every coding agent, regardless of vendor, runs a variation of the same while-loop:

while not done:
    action = llm.next_action(context)   # structured tool call JSON
    result = execute(action)            # bash, file read, grep, etc.
    context.append(action, result)      # grows conversation history

The model never touches the filesystem directly. It emits structured JSON describing what it wants, the scaffolding executes the operation, captures the output, and feeds it back. The loop exits when the model returns a response with no tool calls attached. Every iteration appends at minimum two entries to the conversation history: the model’s tool call, and the tool’s result. After twenty iterations, you have forty additional messages sitting in context on top of everything else.

This architecture comes from the ReAct paper (2022), which interleaved verbal reasoning with concrete actions. The model generates a chain of thought, calls a tool, reads the result, thinks again. Modern coding agents inherit this implicitly: the model’s prose output before a tool call is its reasoning trace, which actively shapes the quality of what it does next. The streaming output you see while an agent works is not just for user reassurance; it matters to the model too.

How Agents Edit Files

File editing is where scaffolding complexity concentrates. “Write a file” sounds trivial, but the engineering choices here determine whether an agent is reliable or frustrating.

Full-file rewrite is the simplest approach. The model outputs the entire file contents, the scaffolding writes it. It fails on files longer than a few hundred lines because the model starts hallucinating unchanged sections, and token costs scale directly with file size. It works for new file creation and short files.

String replacement is Claude Code’s primary strategy. The model provides an exact old_string and a new_string; the scaffolding does a literal find-and-replace:

{
  "file_path": "/absolute/path/to/file.py",
  "old_string": "def calculate_total(items):\n    return sum(item.price for item in items)",
  "new_string": "def calculate_total(items, tax_rate=0.0):\n    subtotal = sum(item.price for item in items)\n    return subtotal * (1 + tax_rate)"
}

If old_string is not found, the tool returns an explicit error and the model retries. If it appears more than once and replace_all is false, the call is rejected. This is deterministic and precise, but brittle: one wrong character in old_string breaks it. A model that reconstructs the target text from memory rather than re-reading the file will produce slightly wrong values, and the tool will refuse. The correct discipline is always read before edit, every time, even if the model believes it already knows the contents.

SEARCH/REPLACE blocks are Aider’s approach, using a fenced format that the scaffolding parses:

<<<<<<< SEARCH
def old_function(x):
    return x * 2
=======
def old_function(x, multiplier=2):
    return x * multiplier
>>>>>>> REPLACE

Aider tries exact match first, then falls back to Python’s difflib.SequenceMatcher for fuzzy matching. It also supports unified diffs and whole-file rewrites, and tracks which format is producing valid edits, switching automatically when one starts failing. This is a meaningful layer of defensive engineering.

Unified diffs map naturally to developer mental models and work with standard tools like patch. The problem is that models hallucinate line numbers at a non-trivial rate and get context lines slightly wrong. Aider has published benchmarks showing significant variation by model family on diff quality.

Dual-model apply (Cursor’s “Instant Apply”) separates reasoning from generation: the primary model produces a high-level description of the change, and a smaller, faster specialized model generates the actual file edit. The reasoning model stays focused; the apply model handles the mechanical accuracy requirement.

None of these strategies is universally correct. Each trades off simplicity, token efficiency, reliability across file sizes, and robustness to model drift.

Context Is the Central Constraint

Every architectural decision in a coding agent is downstream of the context window. It is the agent’s only working memory. Whatever is not in context does not exist for that agent at that moment.

A 200,000-token window sounds generous until you run the arithmetic. A system prompt for a production agent consumes 5,000-10,000 tokens before any task starts. A single medium-sized source file is 2,000-5,000 tokens. Ten file reads uses a quarter of the budget. A verbose dependency tree or long test run can dump 20,000-50,000 tokens in a single tool result.

Agents handle this through several mechanisms. Output truncation cuts bash results at a character limit, inserting a marker so the model knows it saw a partial result. The design choice of where to truncate matters more than it sounds: test runners typically put failure details at the end. An agent that truncates from the end discards the most useful signal.

Selective reading helps: Claude Code’s Read tool accepts offset and limit parameters. Reading a 3,000-line file to edit line 2,847 wastes context; reading lines 2,830-2,870 is precise. Aider tracks running token count and shrinks its repository map as context fills, trading codebase visibility for task history. The Anthropic API’s prompt caching marks stable context prefixes as cacheable, reducing both cost and latency across turns.

When context approaches the limit entirely, Claude Code triggers compaction: the LLM summarizes the conversation so far, and the session restarts with that summary as the new system prompt. CLAUDE.md content is re-injected afterward. But conversational constraints stated mid-session may not survive compaction in precise form.

This connects to the lost-in-the-middle effect (Liu et al., 2023): models with long contexts perform reliably when relevant documents are at the beginning or end of the context window and substantially worse when they are in the middle. Instructions placed in CLAUDE.md at session start occupy the highest-reliability position. A constraint stated mid-session lands in the middle of a growing context and receives less reliable attention as the session progresses.

Before a coding agent can change anything, it has to find the right code. Navigation strategy is one of the largest architectural differentiators between agents.

The iterative grep-and-glob approach, used by Claude Code by default, starts from an anchor (a failing test, an error message, a filename) and traces outward through imports and references. Fast, no indexing, language-agnostic. On SWE-bench, top systems average 20-30 tool calls per resolved issue, with a significant fraction being navigation rather than editing.

Aider builds a repository map using tree-sitter to parse every file and extract function names, class definitions, and method signatures with file paths and line numbers. For a 50,000-line codebase this typically produces 1,000-8,000 tokens, and Aider dynamically resizes it as conversation history grows using a PageRank-style relevance algorithm that prioritizes recently-touched files.

Cursor integrates Language Server Protocol queries directly: go-to-definition, find-all-references, and type information from tsserver, rust-analyzer, or pylsp. LSP results are exact with no false positives from string matching. Claude Code exposes this as an explicit tool too. The limitation is that it requires a working, configured language server, which is not always the case.

Embedding-based semantic search (Cursor’s @codebase, GitHub Copilot) chunks files, embeds them, and retrieves top-K results by cosine similarity. This finds conceptually related code even when naming conventions differ, at the cost of index infrastructure and freshness concerns.

The SWE-Bench Number

SWE-bench Verified benchmarks agents against 2,294 real GitHub issues from well-maintained Python repositories. Top agents with good scaffolding hit 49-60%. The same models in simpler frameworks land at 18-22%. The Princeton SWE-agent paper introduced the “Agent-Computer Interface” concept and quantified this: changes to tool descriptions and output formatting produced large performance swings independent of model selection.

Tool description wording matters because the model reads it to decide how and when to call the tool. A description that says “Read file contents” and one that says “Read the contents of a file at a given path. Always read a file before editing it. Use offset and limit parameters for large files” produce measurably different behavior from the same underlying model.

The scaffolding handles process management, state tracking, tool routing, interruption handling, cost tracking, and context compaction. The model decides what to do next. Both matter, but the gap between good and bad scaffolding is larger than the gap between good and great models, at least for code tasks on current benchmarks.

Building your own coding agent is worth doing for the understanding it forces. The loop is simple to implement. The engineering that makes it reliable over long sessions, against large codebases, with recoverable errors and predictable context usage, is where the real work is.

The Scaffolding Is Most of the Agent

The Loop

How Agents Edit Files

Context Is the Central Constraint

Codebase Navigation

The SWE-Bench Number