· 7 min read ·

The Tool Loop at the Heart of Every Coding Agent

Source: simonwillison

Simon Willison recently published a guide on how coding agents work as part of his agentic engineering patterns series. It is a solid primer on the mechanics. What I want to do here is go deeper on a few specific parts that I think are underexplained in most coverage of this topic: the structure of the tool-use loop, why context management is genuinely hard, and how different tools have made different trade-offs in solving it.

The Loop, Concretely

Every coding agent, regardless of vendor or interface, is built around the same fundamental pattern. The model produces a response. If that response contains a tool call, the host application executes the tool, appends the result to the message list, and calls the model again. This continues until the model produces a response with no tool calls, which is the signal that it is done.

In pseudocode:

while True:
    response = llm.chat(messages)
    if not response.tool_calls:
        print(response.text)
        break
    messages.append(response)  # assistant turn
    for call in response.tool_calls:
        result = dispatch(call.name, call.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result
        })

The model itself does nothing except produce text. All side effects, reading files, writing files, running commands, come from tools. The LLM reasons about what to do; the host application does it and reports back.

This separation matters more than it might seem. It means the “intelligence” and the “action” are cleanly decoupled. You can swap the model without changing the tool implementations. You can add new tools without retraining anything. The model learns what tools are available from a description in the system prompt or the API’s tool schema, and it generalizes from that.

The Standard Tool Set

Most coding agents converge on a similar core set of tools:

ToolPurpose
read_file(path)Load file contents into context
write_file(path, content)Write or overwrite a file
bash(command)Run arbitrary shell commands
list_directory(path)Explore the directory tree
search_files(pattern, glob)Ripgrep-style content search
web_fetch(url)Read documentation or reference material

The bash tool is the most powerful and the most dangerous. It subsumes most of the others, since you can implement read_file as cat, write_file as a redirect, and search_files as grep -r. But a typed read_file tool is preferable because it makes the agent’s intent auditable. When you see a read_file call in a transcript, you know exactly what happened. A bash call with cat is less clear in an automated review pipeline.

Claude Code exposes both: dedicated structured tools for common operations, plus a general bash tool for everything else. Aider takes the opposite approach and does almost everything through shell commands, relying on the model to produce unified diffs that it then applies with patch. Each philosophy has merit, and the right choice depends on how much you trust the model’s output to be well-formed.

Stopping Conditions Are Underspecified

The loop terminates when the model produces a turn with no tool calls. But that condition is not always clean. A model might produce a partially reasoned response mid-task, with no tool calls but also no completed output. Or it might loop indefinitely on a failing test, trying the same fix repeatedly.

Practically, every serious coding agent adds a hard cap: a maximum number of iterations, a token budget, or both. Claude Code tracks context window usage and will warn the user before hitting the limit. Devin sets time budgets for tasks. Aider surfaces iteration counts and lets you interrupt.

The deeper problem is that the model has no reliable way to know when it has succeeded. It can run tests and check the output. It can re-read the files it modified. But verifying correctness is as hard as writing correct code in the first place, sometimes harder. The result is that agents tend to declare success based on shallow signals: tests pass, no error output, the diff looks right. This is fine for straightforward tasks and fragile for complex ones.

The Context Window Is the Real Problem

The tool loop is not complicated. What is complicated is making it work on a real codebase.

A mid-sized production codebase might have 500,000 lines of code. Even the largest context windows available today, Gemini 1.5 Pro’s 2 million tokens, Claude’s 200k tokens, cannot hold all of it. Even if they could, filling the context with irrelevant code degrades model performance. The model attends to everything in its context, and noise hurts precision.

So every coding agent has to answer a harder question: given a task, which parts of the codebase should be loaded into context?

Aider’s Repo Map

Aider solves this with a “repo map”: a compact representation of every symbol in the codebase, generated by parsing files with tree-sitter. Instead of including full file contents, the repo map includes function signatures, class definitions, and their relationships. This gives the model a structural overview of the entire codebase in a few thousand tokens, which it can use to decide which files to request in full.

The repo map is regenerated on each invocation and fits even large codebases because it strips out function bodies. It is a clever trade-off: you lose implementation details, but you preserve the call graph and public interface of every module.

Cursor’s Embedding Index

Cursor takes a retrieval-augmented approach. It builds a vector index of the codebase at indexing time, embedding file chunks using a code-aware embedding model. When you ask a question or give a task, relevant chunks are retrieved by similarity search and injected into the context automatically.

This works well for semantic queries, “find where we handle OAuth tokens” or “show me how errors are surfaced to the user”. It works less well for structural queries, “find all callers of this function”, where call graph traversal or grep is more reliable. Cursor supplements the embedding index with a traditional code search path for these cases.

Claude Code’s Selective Loading

Claude Code does not maintain a persistent index. Instead, it relies on the model to navigate the codebase through tools: list a directory, read a file, search for a pattern, repeat. The model builds its own understanding of the structure through exploration.

This approach requires more tool-call round trips at the start of a task. On a well-organized codebase with clear naming conventions and a good README, it works quickly. On a sprawling legacy codebase with inconsistent structure, the exploration phase can consume a significant fraction of the context budget before any meaningful work is done.

The trade-off is freshness versus speed. An embedding index can go stale between indexing runs. Selective loading is always current because it reads files as they exist on disk. For codebases under active development, that matters.

What the Loop Actually Looks Like in Practice

Here is a condensed transcript of what a coding agent does when asked to fix a bug in a medium-complexity codebase:

  1. list_directory("/") to understand the project layout
  2. read_file("README.md") to get a high-level description
  3. search_files("authentication", "*.py") to find relevant files
  4. read_file("src/auth/middleware.py") to read the candidate file
  5. read_file("src/auth/tokens.py") after spotting a related import
  6. bash("python -m pytest tests/test_auth.py -x") to see the current test failure
  7. write_file("src/auth/middleware.py", ...) to apply the fix
  8. bash("python -m pytest tests/test_auth.py -x") to verify
  9. Final text response: the fix is applied, tests pass

Nine steps, seven tool calls, two round trips to verify. This is representative of a well-scoped bug fix. A feature addition might take forty steps. A refactor touching many files might take a hundred.

The Parallel Execution Problem

One limitation of the sequential loop is that it is slow when tasks have independent subtasks. Reading five files that do not depend on each other should be parallelizable, but the standard tool-use loop does one tool call per turn.

Some agent frameworks address this by allowing the model to emit multiple tool calls in a single turn, a pattern the Anthropic and OpenAI APIs both support. Instead of read-file-A, wait, read-file-B, wait, the model can emit both reads simultaneously and receive both results before its next turn. This can cut exploration latency significantly on tasks that require reading many independent files.

Not all models use this capability reliably. Getting a model to emit well-formed parallel tool calls consistently requires prompt engineering and sometimes fine-tuning. Claude Code supports parallel tool calls, and for I/O-heavy tasks, the speedup is measurable.

Where This Architecture Leads

The tool-use loop is not a temporary hack waiting to be replaced by something more sophisticated. It is a direct consequence of how LLMs work: they produce text, they do not execute code. The loop is the minimal architecture that gives a text-producing model the ability to affect the world.

What will improve is not the loop structure, but what happens inside it: better context selection, smarter stopping criteria, richer tool sets, and models that are more calibrated about when they are confident versus when they need to explore further. The agents that are most useful today are the ones that fail gracefully, tell you when they are stuck, and do not silently produce plausible-looking wrong code.

That last property is harder to engineer than it sounds, and it is where most of the real work in this space is still happening.

Was this interesting?