The Scaffolding Is the Product: How Coding Agents Actually Work

Simon Willison recently published a thorough guide on agentic engineering patterns covering how coding agents work. It is worth reading in full. But there is one thing the overview leaves implicit that I want to make explicit: the LLM at the center of these systems is nearly a commodity. The scaffolding wrapped around it is where the actual engineering happens, and it accounts for most of the performance difference you see between agents.

SWE-bench Verified, the standard benchmark for evaluating coding agents on real GitHub issues, bears this out. Top models paired with good scaffolding hit around 49%. The same models in simpler frameworks land at 18-22%. That is not a small gap, and it is not explained by the model.

The Loop

Every coding agent runs the same fundamental cycle:

while not done:
    context = gather_context()        # files read, prior tool outputs
    action  = llm.next_action(context) # structured tool call JSON
    result  = execute(action)          # bash, file edit, grep, etc.
    context.append(action, result)     # grows the conversation history

The LLM never touches the filesystem or runs commands directly. It emits JSON describing what it wants to do, and the scaffolding executes that, captures the output, and feeds it back. This is the same architecture whether you are looking at Claude Code, Aider, or GitHub Copilot’s agent mode. The differences are entirely in how the scaffolding is built.

How Files Get Edited

There are three dominant strategies for how an agent applies code changes, and each involves real tradeoffs.

Full-file rewrite: The LLM outputs the entire file. The scaffolding writes it atomically. Simple to implement, but wasteful on tokens and loses surrounding context for large files. Some agents use this as a fallback.

String replacement: The LLM specifies an exact old_string and a new_string. The scaffolding finds the old text and replaces it. Claude Code’s Edit tool works this way:

{
  "file_path": "/absolute/path/to/file",
  "old_string": "exact text to find",
  "new_string": "replacement text",
  "replace_all": false
}

The scaffolding does a literal indexOf, replaces, writes back. If old_string is not found, it returns an error so the LLM can retry. This approach is simple and precise, but it is fragile. If the LLM slightly misquotes the old string, the edit fails. Whitespace differences kill it. On large files where the target string appears more than once, the tool rejects the call entirely and demands a larger context window from the LLM to disambiguate.

Unified diffs and SEARCH/REPLACE blocks: Aider’s primary format uses custom markers:

<<<<<<< SEARCH
old code here
=======
new code here
>>>>>>> REPLACE

The scaffolding applies fuzzy matching if the exact string is not found, using Levenshtein distance to handle minor whitespace or indentation differences. This is more resilient than str-replace, but requires more complex scaffolding code and can apply incorrect edits if the fuzzy match picks the wrong location.

None of these strategies is clearly dominant. The right choice depends on how reliably the LLM quotes file content and how much you are willing to complicate the scaffolding.

When an agent starts a task on a codebase it has never seen, it has no index, no map, no knowledge of what is where. How it builds that knowledge is another major differentiator.

Grep and glob: Claude Code and, to some extent, Aider rely on the LLM using grep and glob tool calls to discover the codebase iteratively. The agent starts at a known point (the failing test, the error message, the file the user mentioned) and traces outward through imports and references. This works reliably for most tasks but can take many turns for large codebases.

Repository map: Aider builds a compact representation of the entire repo at startup using tree-sitter to extract all function, class, and method signatures across every file. This map, typically 1-8k tokens, is included in every prompt. The LLM can see what exists without reading every file. Aider adjusts map size dynamically based on remaining context budget, shrinking it as conversation history grows.

Embedding search: Cursor and GitHub Copilot chunk source files, embed them using a text-embedding model, and do cosine similarity search at query time. The top-K relevant chunks go into context. This finds conceptually related code even when the naming differs from what you searched for. The downside is infrastructure overhead and occasional false positives.

LSP integration: Language Server Protocol queries give agents go-to-definition, find-references, and type information without regex search. Claude Code exposes an LSP tool for this. For typed languages, this is far more precise than grep, and it avoids the false positives you get from searching for common identifiers.

Context Is the Hard Problem

Conversation history grows with every tool call and result. A bash command that dumps 10,000 lines of test output fills context fast. Managing this is where scaffolding design gets genuinely difficult.

The strategies in use:

Output truncation: bash output is cut at a character limit (Claude Code defaults to roughly 30k characters per result). A truncation marker is inserted so the LLM knows it is seeing a partial result.
Selective reading: instead of reading entire files, agents use head/tail equivalents or line-offset parameters. Claude Code’s Read tool accepts offset and limit to read a slice of a file.
Conversation compaction: when the context approaches the model’s limit, Claude Code triggers a compaction step. It calls the LLM to summarize the conversation so far, then restarts with that summary as the new system prompt. The task continues, but earlier tool results are collapsed into prose.
Token budgeting: Aider tracks the running token count and reduces the repo map size as context fills, trading codebase visibility for task history.

The tension here is real. More context means the LLM has better information. Less context means the conversation can continue longer before compaction or truncation. Every scaffolding makes different bets about where that line should be.

Shell Execution Models

When an agent runs bash commands, there are two approaches.

Persistent shell: a single bash process lives for the session. Working directory, environment variables, and shell state persist between calls. Claude Code uses this model. The LLM can cd into a directory and subsequent commands run there. The risk is state pollution between unrelated commands, and commands that produce interactive prompts will hang, so the scaffolding enforces timeouts.

Per-command subshell: each command runs in a fresh subprocess. Working directory must be re-established each time. Simpler to implement and easier to reason about, but the LLM cannot build up shell state across calls.

Security is handled at the scaffolding level, not by the LLM. Claude Code’s approval layer flags commands like rm -rf or git push --force for explicit user confirmation before execution. The LLM cannot bypass this by issuing a sufficiently confident tool call. This is the right place for that check.

The Scaffolding Insight

When you pick a coding agent, you are mostly picking a scaffolding. The model inside can often be swapped. Aider supports dozens of models through litellm. Claude Code can target different Claude model variants. What does not change is the edit strategy, the context management approach, the navigation toolset, and the shell execution model.

This is worth understanding before evaluating agents on benchmark numbers alone. A benchmark result reflects a specific scaffolding paired with a specific model. When Anthropic reports Claude 3.5 Sonnet at 49% on SWE-bench Verified, that number includes their scaffolding choices. The same model in a different scaffolding would land somewhere else.

The practical consequence for anyone building on top of these systems: if an agent is failing at a class of tasks, the first question is whether the issue is in the model’s reasoning or in the scaffolding’s tool design. Often it is the latter, and that is fixable without changing the model at all.