The Scaffolding Is the Product: What Actually Happens Inside a Coding Agent
Source: simonwillison
The basic idea behind a coding agent is not complicated. Give a language model access to tools, let it call them, read the output, and keep going until the task is done. Simon Willison’s guide on agentic engineering patterns lays out this core loop clearly. What the guide wisely emphasizes is that the loop itself is nearly trivial to implement. Everything interesting is in what surrounds it.
The Loop Itself
The foundation is what the 2022 ReAct paper (Reason + Act) formalized: interleave reasoning traces with action execution. The model thinks about what to do, calls a tool, receives the result, thinks again, and continues. In practice, modern coding agents don’t use free-form text parsing to extract tool calls. They rely on native function-calling APIs, where the model returns a structured object specifying which tool to invoke and with what arguments.
{
"type": "tool_use",
"name": "read_file",
"input": {
"path": "src/parser/lexer.py"
}
}
The scaffolding code executes the tool, captures stdout, stderr, and exit codes, then appends the result back into the conversation as a tool result message. The model sees a growing transcript of actions and observations. At each step it decides whether to keep working or declare the task complete.
At the API level, this is almost embarrassingly simple. The complexity is everything else.
Tool Design Is Architecture
Which tools you expose to the model, and how you describe them, shapes agent behavior as much as the underlying model does. SWE-agent, the research system from Princeton that helped define how agents approach real GitHub issues, introduced the concept of an Agent-Computer Interface (ACI) as a deliberate design layer. The analogy to HCI is intentional: just as human-computer interfaces need affordances matched to human cognition, agent-computer interfaces need affordances matched to model cognition.
Concretely, this means:
- File viewing tools that show line numbers, so the model can reference specific locations in subsequent edits
- Search tools that return surrounding context rather than just file paths
- Edit tools that operate on line ranges rather than requiring full file rewrites
- Shell execution that captures output even when commands fail, since the error message is usually what the model needs
SWE-agent found that seemingly minor differences in tool design produced large swings in performance on SWE-bench, the benchmark derived from 2294 real GitHub issues. Their ACI-optimized tooling improved resolve rates by several percentage points over naive bash access.
Claude Code takes a different approach: it leans heavily on direct bash execution rather than specialized tools, trusting that a capable model in a real Unix environment can figure out how to navigate a codebase. Both philosophies work. The bash-heavy approach is simpler to build and more flexible; the ACI approach gives more structured affordances and more predictable navigation patterns. The tradeoff is between flexibility and consistency.
What both approaches share is intentionality. Tool schemas are documentation the model reads at inference time. A vaguely described tool produces vaguely correct behavior. A well-described tool with clear parameter semantics and explicit descriptions of what the output format will look like consistently outperforms an equivalent tool with sloppy documentation.
The Context Window Is the Real Bottleneck
A fresh checkout of a moderately-sized project might have 100,000 lines of code. Even the largest context windows available today cannot hold all of that alongside conversation history and tool outputs. Coding agents have to solve a retrieval problem before they can solve the coding problem.
The strategies break down into a few categories.
Selective reading. The agent reads directory listings, finds relevant files through search, and loads only what it needs. This works well for well-structured codebases with clear module boundaries. It breaks down on tangled code where understanding one file requires understanding five others.
Semantic search. Some systems embed the codebase and retrieve chunks by similarity to the current task description. This helps when relevant code is not in an obvious location. The downside is that embedding-based retrieval struggles with syntactic specifics: a search for “user authentication” might miss a function called validate_credentials.
Tree-sitter parsing. Using a real parser to extract function signatures, class definitions, and import graphs lets the agent build a structural map of the codebase without reading every line. The agent can then navigate to relevant definitions on demand. This is how Aider implements its repository mapping feature, using tree-sitter grammars to produce a compact representation of code structure that fits in the context window.
Conversation pruning. As the conversation grows, old tool outputs get summarized or dropped. The model loses access to things it read earlier, which can cause it to re-read files or forget decisions it made several turns ago. Managing this gracefully is one of the harder engineering problems in building reliable agents. A naive implementation will prune aggressively and cause the agent to loop; a conservative one will hit context limits on long tasks.
The “lost in the middle” research from Stanford and UC Berkeley is relevant here. LLMs perform measurably worse on information placed in the middle of long contexts compared to the beginning or end. This means that how tool results are ordered in the conversation matters, not just whether they are present. Critical information should appear near the most recent turn, not buried in the middle of a long transcript.
Error Recovery as a First-Class Concern
A coding agent that can only succeed when every step works is not useful. Real codebases have tests that fail for reasons unrelated to the change, linters that complain about style, and build systems that behave differently across environments.
The agent loop handles simple errors naturally: when a bash command returns a non-zero exit code, the output goes back into the context as an observation, and the model can try a different approach. What matters is whether the model correctly distinguishes a recoverable error from an unrecoverable one, and whether it recognizes when to stop trying.
Unconstrained retry loops are a real failure mode. An agent that keeps attempting variations on the same wrong approach can burn through tokens and time without making progress. Systems like Claude Code address this by keeping humans in the loop for certain categories of decisions. When the agent is uncertain, it asks. This represents a genuine design choice between full autonomy and interactive autonomy. The systems that work best in practice tend toward the interactive end, at least for current model capabilities.
The research framing here is the distinction between oracle agents, which have access to some correct-answer signal during evaluation, and deployed agents, which must decide on their own when the task is done. SWE-bench measures performance with the oracle available (a test suite that passes or fails). Real deployments must handle the messier case where success criteria are implicit and sometimes contradictory.
What the Scaffolding Actually Does
The model is one component of a coding agent. The scaffolding around it handles:
- Process management. Running bash commands, managing working directories, handling timeouts for commands that hang, killing processes that produce too much output.
- State management. Tracking which files have been modified, maintaining a diff of changes made so far, optionally managing git operations like staging and committing intermediate work.
- Tool routing. Parsing tool call responses, dispatching to the right handler, formatting results back into the conversation in a way the model expects.
- Interruption handling. Letting the user interrupt a long-running agent, inspect the current state, and decide whether to continue or redirect. This is non-trivial to implement correctly without either losing context or presenting a confusing mid-execution state to the user.
- Cost tracking. Counting tokens, estimating costs, enforcing budgets to prevent runaway executions.
Open source systems like SWE-agent and Aider make this scaffolding visible. Looking at their source shows how much engineering sits between “the model has tools” and “the agent works reliably.” The model is maybe 20% of the code. The scaffolding is the rest, and it determines the user experience far more than the model itself does.
This is worth sitting with. When a coding agent produces a wrong answer or takes a confusing action, the instinct is to blame the model. Often the problem is the scaffolding: a tool whose output format the model misread, a context pruning strategy that dropped a critical observation, an error message that was swallowed rather than returned. Model improvements help, but they do not fix scaffolding bugs.
Where the Real Limits Are
Model capability is obviously a ceiling. Agents built on weaker models fail more often, make more logical errors in multi-step reasoning, and produce lower-quality code. The SWE-bench Verified leaderboard makes this concrete: the gap between frontier models and models from two years prior is substantial. As of early 2026, top systems resolve somewhere between 50 and 60 percent of benchmark issues, which sounds impressive until you note that these are curated, well-specified issues from well-maintained Python repositories with comprehensive test suites.
Real software projects are messier. Issues are underspecified. Codebases have unusual conventions that are not written down anywhere. Test suites are flaky. An agent that performs well on SWE-bench still needs meaningful supervision on production code.
The more fundamental constraint is that coding agents operate by reading and writing text. They do not execute code to understand it; they read it. When behavior is difficult to infer from static analysis, which is common, the agent has to use execution as a probe: running a minimal test case to observe actual behavior before writing a fix. The best systems employ this deliberately. Less sophisticated ones skip the observation step and guess, which produces code that looks plausible and is wrong in ways that only manifest at runtime.
What This Means for Building With Agents
If you are building on top of coding agent infrastructure rather than building the infrastructure itself, the practical implications are clear.
Task specification quality matters more than almost anything else. An agent given a vague task will make assumptions, and some will be wrong. “Fix the login bug” produces something different from “The authenticate_user function in auth/session.py raises a KeyError when the session token is expired rather than returning None. Fix it to match the return type in the docstring.” The second form constrains the search space and tells the agent exactly where to look.
Smaller, well-scoped tasks succeed more reliably than large open-ended ones. The agent loop accumulates context; a task that requires 20 tool calls will have a much more polluted context window than one that requires 5. Longer tasks also have more surface area for errors to compound.
The choice of which tools to expose affects which strategies the agent can pursue. A sandboxed environment where the agent cannot run arbitrary shell commands will be safer but less capable. That tradeoff is real and worth making deliberately, not by accident.
Simon Willison’s framing of coding agents as a tool-use loop is the right mental model to start with. The interesting questions are all about what fills in around that loop: which tools, how the context is managed, how errors are handled, and how much human oversight stays in the picture. Those choices determine whether an agent is a useful collaborator or an expensive way to generate plausible-looking wrong answers.