Context, Tools, and the Loop: The Real Mechanics Behind Coding Agents
Source: simonwillison
Simon Willison recently published a thorough guide on how coding agents work, and it’s worth using as a jumping-off point to go deeper on the mechanics that most explanations skip over. The high-level description of “LLM calls tools in a loop” is accurate but leaves out the parts that actually matter when you’re building or evaluating one of these systems.
The Loop Itself
The foundation is the ReAct pattern, described in the 2022 Yao et al. paper. The model reasons, then acts, then observes the result, then reasons again. In practice with modern tool-calling APIs, this looks like:
- The model receives a system prompt, conversation history, and a JSON schema describing available tools
- The model responds with either plain text or one or more tool calls
- The host executes those tool calls and appends the results to the conversation
- The model is invoked again with the updated context
- Repeat until the model produces a plain-text response with no tool calls
The Anthropic tool use API implements this with tool_use content blocks in the assistant turn and tool_result blocks in the subsequent user turn. OpenAI does essentially the same thing. The protocol is simple; what varies enormously is everything built around it.
Tool Design Is the Agent
The set of tools an agent exposes determines almost everything about its behavior. Compare Claude Code’s tool set with something like Aider’s approach.
Claude Code exposes tools like bash, read_file, write_file, edit_file, glob, and grep. The bash tool is particularly powerful because it lets the agent run arbitrary shell commands, which means it can invoke compilers, test runners, linters, package managers, and anything else on the system. This generality is also a risk surface, which is why the tool prompts users for confirmation on potentially destructive commands.
Aider takes a more constrained approach. Rather than a general bash tool, it uses structured file-editing tools that produce SEARCH/REPLACE blocks or unified diffs. The model outputs something like:
src/utils.py
<<<<<<< SEARCH
def process(items):
return [i for i in items]
=======
def process(items):
return [transform(i) for i in items]
>>>>>>> REPLACE
Aider then parses this and applies the edit. This is more brittle if the model gets the SEARCH block slightly wrong, but it makes the editing operation explicit and reviewable. The tradeoff is reliability versus flexibility.
The str_replace_editor approach used by Claude Code’s edit tool sits in between: the model specifies an exact old string and a new string, and the tool finds and replaces it. This fails if the old string doesn’t match exactly, but it’s more token-efficient than rewriting whole files and more reliable than generating diffs.
The Context Window Is the Working Memory
This is where most explanations of coding agents undersell the difficulty. The context window is the agent’s only working memory. Everything the agent “knows” about a task at any point in time is whatever fits in that window: the original request, every file it has read, every tool call it has made, every result it has received.
For a small bug fix in a single file this is fine. For a refactor that touches twenty files in a 200,000-line codebase, it becomes the central engineering challenge.
Different agents handle this differently. Aider uses a repository map, a compact index of the codebase built using tree-sitter to extract function signatures, class names, and call relationships. The repo map gives the model a high-level overview of the whole codebase without reading every file, and Aider dynamically adjusts which files are included based on what’s relevant to the current task. This is clever because it uses the model’s own token budget efficiently.
Cursor takes a retrieval-augmented approach. It embeds code chunks semantically and retrieves the most relevant chunks into context when you ask a question. This works well for “find where X is implemented” queries but can miss the broader context that a human engineer would hold in their head.
Claude Code handles context overflow by summarizing older parts of the conversation, compressing them into a shorter representation that preserves the key facts. This loses information but lets the agent work on longer tasks without hitting a hard wall.
None of these approaches is a complete solution. They’re all approximations of the mental model a human engineer builds up over years of working in a codebase.
The Interruption Problem
One of the harder design problems in coding agents is deciding when to stop and ask the user a question versus continuing autonomously. Get it wrong in one direction and you have an agent that asks for confirmation on every trivial step, which is annoying and slow. Get it wrong in the other direction and you have an agent that confidently deletes the wrong files or makes architectural decisions it shouldn’t.
Claude Code addresses this partly through tool-level confirmation: destructive operations require explicit approval, while read operations proceed silently. But the harder case is semantic: when the agent discovers mid-task that the request is ambiguous, or that fulfilling the request as stated would break something else.
The best agents I’ve used tend to handle this by front-loading clarification. Before starting a complex task they’ll ask the one or two questions that would most change their approach, rather than asking nothing and then discovering a showstopper halfway through.
This is a prompt engineering problem as much as an architecture problem. The system prompt for a coding agent has to teach the model when ambiguity is worth resolving versus when it’s fine to make a reasonable assumption and mention it in the response.
Multi-Step Execution and Side Effects
When an agent runs tests as part of its loop, it’s doing something humans do all the time: write code, run it, observe the output, adjust. The key insight is that the test runner output becomes tool call output that goes back into context, and the model can reason about it.
A concrete example: the agent writes a function, runs pytest, sees a failing assertion, reads the error message in context, understands what went wrong, edits the file, and runs the tests again. This loop can run several times before the agent is satisfied. For a human watching in real time this looks almost like watching someone code at the terminal.
The problem is side effects. If the agent runs a migration script as part of testing its changes, that migration has now run. If the agent sends an HTTP request to an external API, that request happened. The tool loop makes it easy to accumulate side effects that are difficult to undo, which is why the most careful agent implementations sandbox execution or operate on branches rather than directly on main.
What Makes Some Agents Better Than Others
The core loop is nearly identical across all modern coding agents. The differences come from:
Tool granularity. Agents with well-designed, composable tools that give specific feedback fail more informatively. A tool that returns “command failed” is less useful than one that returns the full stderr output.
Context curation. How the agent decides what to include in context as a task progresses determines whether it can maintain coherent reasoning over long tasks. Agents that dump everything into context hit walls quickly; agents that curate intelligently can sustain longer workflows.
System prompt quality. The instructions given to the model about how to use its tools, when to ask questions, and how to handle errors are the unsexy part that separates working agents from broken ones. You can see hints of this in Anthropic’s published model cards and in how agents behave when they hit edge cases.
Recovery behavior. What does the agent do when a tool call fails? When a test won’t pass after three attempts? The best agents recognize when they’re stuck and surface that to the user rather than spinning in a loop.
The underlying LLM matters too, but less than most people assume. A mediocre agent architecture with a strong model will still make bad tool use decisions. A well-designed agent with clear tools, good context management, and sensible interruption logic will outperform a naive wrapper around a more capable model.
For anyone building on top of these patterns, Anthropic’s agent documentation and the Claude Code source patterns are worth studying closely, not for the API calls, but for how the tool boundaries are drawn and what assumptions the system prompt encodes. That’s where the real design decisions live.