The Scaffolding Is the Product: What Building a Coding Agent Actually Requires
Source: simonwillison
The loop is about 40 lines of Python. Simon Willison’s guide to how coding agents work makes this clear, and it is worth sitting with for a moment, because it changes how you think about what you are actually building when you build a coding agent.
Call the model, check if it returned a tool call, execute the tool, append the result, repeat. That structure is the entirety of the agentic loop. Everything that makes a coding agent useful, or not, lives in what you write around that loop: the tools, the descriptions of those tools, the system prompt, the context management strategy, the permission controls, and the stopping conditions. The misleading thing about the “coding agents are just tool loops” framing is that it suggests the hard part is mechanical. It is not. The hard part is closer to API design than systems programming.
Tool Descriptions Are the Interface
When you define a tool for an LLM agent, the description field is not documentation. It is code. It specifies what the model will call the tool for, when it will prefer it over alternatives, and what assumptions it will make about the output format.
Here is what a weak tool description looks like:
{
"name": "search_code",
"description": "Search for text in code files"
}
And a stronger one:
{
"name": "search_code",
"description": "Search file contents using ripgrep. Use this to find function definitions, class names, or any specific string across the codebase. Prefer this over read_file when you don't know which file contains the information you need."
}
The first description produces a model that sometimes searches when it should read, and sometimes reaches for a shell grep command instead of the dedicated tool. The second gives the model a clear mental model of when and why to use this capability.
This is API design. You are designing an interface for a consumer that will call your API based entirely on the contract you describe in natural language. Getting this wrong produces subtle misbehavior that looks, from the outside, like the model making bad decisions. Usually it is not; the interface was unclear. This is one of the most underappreciated aspects of building a reliable coding agent, because the failure mode is not a crash or an exception, it is the model doing something slightly adjacent to what you wanted, consistently.
Anthropic’s tool use documentation shows the mechanics of how tool definitions reach the model, but the documentation cannot tell you how to write a description that produces reliable behavior across a range of inputs. That judgment comes from iteration.
The Stopping Condition Is a Product Decision
The loop terminates when the model produces a response with no tool calls. This works for the majority of tasks. The edge cases include: a model that emits a response with no tool calls and no completed output; a model that loops indefinitely on a failing test, trying the same fix with minor variations; a model that hits a dead end, has nothing useful to say, and produces a vague acknowledgment rather than asking for help.
Every production coding agent adds explicit stopping constraints on top of the model’s natural stopping behavior: a maximum number of turns, a token budget, a structural check on the output. Without these, an agent that gets stuck will terminate confusingly or loop until it exhausts the context window.
The decision reveals your assumptions about what the agent is for. Aider caps per-session token usage and surfaces it directly in the terminal, treating the agent as a cost-bounded tool and making the user responsible for staying within budget. Claude Code tracks context window usage and warns the user when approaching limits, treating context as a resource to manage rather than a hard budget to enforce. These are different philosophies about user control.
The maximum-iterations cap is the other common safeguard. If the agent has run 30 tool calls without producing a final response, something has gone wrong. Whether you stop it automatically or surface the state for human review is a design choice, and getting it wrong produces either an agent that silently loops on failing tests or one that interrupts you every time it needs more than a few steps. Neither is obviously correct; the right answer depends on how autonomous you want the agent to be and how much you trust it on your specific codebase.
The System Prompt Is the Constitution
The system prompt is not configuration. It is the foundational specification for how the agent behaves in every ambiguous situation. When the model faces a question it was not explicitly prepared for, should it modify a file it was not asked about? Should it run a command with potential side effects? Should it ask for clarification or make a reasonable assumption? It defaults to whatever the system prompt says about how to handle these cases. A vague or incomplete system prompt produces inconsistent behavior that is hard to debug because the same instruction produces different results depending on the surrounding context.
Claude Code uses a CLAUDE.md file to inject project-specific context into the system prompt: architecture notes, conventions, constraints, file organization. The mechanism is a structural solution to the session-statefulness problem. An agent’s conversation history disappears at the end of a session. The CLAUDE.md persists. You offload long-term context into a document you maintain, rather than a process the agent repeats through tool calls at every session start.
The practical effect is significant. An agent that reads CLAUDE.md at startup already knows where the auth code lives, which test runner to use, and which files not to touch without asking. An agent without this reads the same files and discovers the same structure on every invocation. The exploration cost accumulates on any project with more than a few weeks of history.
This is also where you encode the agent’s risk posture: which operations require explicit confirmation, which paths are off-limits, what style conventions to follow. Leaving these unspecified means the model falls back on its general training, which may not match your expectations.
The Bash Tool Is Doing Most of the Work
Of all the tools in a typical coding agent, the shell executor matters most for output quality. It subsumes reading, searching, and writing as capabilities, but its more important function is closing the feedback loop.
A coding agent that can only write files is working without feedback. It writes a change, has no mechanism to verify correctness, and either declares success or asks you to check. An agent with shell access runs the tests, reads the error output, and adjusts. Paul Gauthier has documented this extensively in Aider’s benchmark methodology: test-driven agentic workflows on the SWE-bench evaluation suite consistently outperform write-only workflows by a substantial margin.
The shell tool is also where you make your most consequential trust decision. Full shell access in an isolated environment is fine. Full shell access on a developer’s machine, with write access to production configuration, is not. E2B and similar sandbox providers exist specifically to give the Bash tool somewhere safe to run, executing commands inside microVMs that can be terminated and discarded without affecting the host environment.
Aider makes a deliberate choice to limit shell exposure: by default it operates through file writes and git commits, using git as the undo mechanism. Less power, more predictability, and a cleaner audit trail. Whether that trade-off is correct depends on what you are building and how much automated verification you need.
Parallel Tool Calls Change the Latency Profile
The Anthropic and OpenAI APIs both support parallel tool calls, where the model emits multiple tool calls in a single response and the host executes them concurrently before returning all results in the next turn.
For I/O-heavy exploration phases, this matters. An agent reading five related files before making any changes can do so in one round trip instead of five sequential ones. On a task that requires reading ten or fifteen files to understand the scope of a change, the difference between sequential and parallel loading is measurable both in wall-clock time and in per-task cost.
Handling parallel tool calls correctly requires matching each result back to its corresponding tool call ID when appending to the conversation history. This is straightforward in the normal case and easy to get wrong when tools fail partially or return errors. Returning results out of order, or conflating results from two simultaneous reads, produces model behavior that can be difficult to diagnose because the symptom appears downstream, not at the point of the error.
Debugging Is Reading Traces, Not Logic
When a conventional program has a bug, you read a stack trace, add logging, or run a debugger. When a coding agent produces wrong output, you read the tool call transcript: what the model read, what it wrote, what commands it ran, what it decided to do at each step.
The transcript is the source of truth. The model’s text output between tool calls tells you what it was reasoning about, but the tool calls tell you what information it had and what it actually did with that information. The common failure pattern is a model that proceeded from a correct premise but with incomplete context. It fixed the thing it could see and missed the thing it could not. Reading the transcript at the point of divergence usually shows the model searching for something, not finding it, and proceeding on a false assumption rather than asking.
Tool design affects debuggability in a direct way. An agent that uses dedicated typed tools, read_file, search_code, write_file, produces a transcript you can scan quickly. An agent that routes everything through a general bash tool produces a transcript of shell commands, which is harder to parse when the session is long and the commands are non-obvious.
The Minimal Implementation
Building a minimal coding agent from scratch is the fastest way to internalize these trade-offs. The scaffolding for a working agent is 100 to 200 lines of code: an LLM client, a handful of tool implementations, and the loop. The Anthropic tool use documentation gets you to a running loop in under an hour.
What you learn quickly once it is running: tool descriptions matter far more than expected, stopping conditions require deliberate thought, and the first thing you want to add after the basics is a way to run tests and feed the output back into the context. From there, every subsequent improvement is a response to a specific failure you observed: a tool that was misused, a task that looped too long, a session that lost track of an earlier constraint.
The loop itself is the easy part. What you build around it is the actual product.