Why the Tool Loop Is the Easy Part of Building a Coding Agent

The core loop of a coding agent is about eight lines of Python:

while True:
    response = client.messages.create(model=model, tools=tools, messages=messages)
    messages.append({"role": "assistant", "content": response.content})
    if response.stop_reason == "end_turn":
        break
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = dispatch_tool(block.name, block.input)
            tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": str(result)})
    messages.append({"role": "user", "content": tool_results})

Simon Willison’s guide on how coding agents work covers this clearly. The loop calls the model, appends its response to the conversation, checks if it’s done, executes any tool calls, appends the results, and repeats. Every tool call and result gets written into the same growing conversation history the model reads on each iteration. The model accumulates context until it emits a response with no tool calls, at which point the loop exits.

This structure traces back to the ReAct pattern (Yao et al., 2022): Reason, Act, Observe, repeat. What the eight-line version doesn’t show is where the engineering work actually lives. Three problems separate a weekend prototype from something like Claude Code or Aider: file editing reliability, context window management, and error recovery that converges rather than spirals.

The Edit Format Problem

Every coding agent needs to modify files. The obvious approach is a write_file tool that accepts a full path and content. For small files, this works fine. For anything over a few hundred lines, it becomes expensive and unreliable. The model has to regenerate the entire file, costing tokens and introducing drift: minor changes to whitespace, comments, and variable names accumulate across regenerations.

Aider uses SEARCH/REPLACE blocks embedded in the model’s text output:

src/auth/session.ts
<<<<<<< SEARCH
  const expiry = new Date(Date.now() + 3600);
=======
  const expiry = new Date(Date.now() + 3600 * 1000);
>>>>>>> REPLACE

This is model-agnostic and readable in conversation history. The cost is that format adherence varies across models. Aider addresses this by selecting different edit formats per model: udiff for models that handle diffs reliably, whole for models where full-file replacement is cleaner, SEARCH/REPLACE as the middle ground for most cases.

Claude Code takes a different approach: structured tool calls with old_string and new_string parameters. The tool fails loudly if old_string isn’t found verbatim in the file, which forces the agent to read the file before editing. That constraint is load-bearing. It prevents edits based on stale assumptions about what the file currently contains, and it surfaces the read-before-write discipline that long-session reliability depends on.

Both approaches avoid line-number addressing. Line numbers are unstable across a long session: early edits shift everything below them. An edit that targets line 87 in the file the agent read twelve turns ago may now be targeting line 94, or may have moved to a different function entirely. String matching against current file contents sidesteps this problem regardless of prior session history.

The failure message design matters as much as the matching strategy. Compare:

Error: old_string not found in src/auth/session.ts.
Current lines 40-45:
  const expiry = new Date(Date.now() + 3600000);

versus just edit failed. The first message gives the model enough information to self-correct on the next step. The second one produces retry attempts using the same incorrect string until something else forces a state change.

Context Window as Process State

The architectural fact that distinguishes coding agents from other software: the context window is the only process state. The model has no variables, no heap, no call stack that persists between inference calls. Every tool call it makes, every result it receives, every prior decision exists as tokens in the conversation history that grows monotonically throughout the session.

This has a direct consequence for reliability. The “lost in the middle” paper (Liu et al., 2023) documented that LLMs recall information in the middle of long contexts significantly worse than information near the start or end. In a coding agent session, early tool results drift toward the middle as the session progresses. An agent will sometimes re-read files it already read thirty turns ago, because the earlier read no longer receives adequate attention weight.

The strategies for managing this diverge meaningfully between tools.

Claude Code applies summarization compaction at around 85% capacity: a secondary model call replaces the middle of the conversation history with a dense summary, preserving the system prompt and recent exchanges verbatim. This is lossy by design. It trades completeness for staying within the context limit without losing the original task instructions, which live at the beginning and receive the highest attention weight.

Aider externalizes structure through a repository map generated using tree-sitter parse output: every symbol with its file path and surrounding context, paid once at session start. This costs 5,000-15,000 tokens upfront on a medium project, but it means the agent doesn’t need to discover structure incrementally through tool calls, and it doesn’t need to re-read files because it already has a structural map.

Cursor maintains a continuous embedding index and retrieves semantically relevant chunks per prompt, keeping initial context proportional to the task’s semantic footprint rather than the full codebase. The failure mode here is that semantic retrieval misses architecturally related code that uses different terminology: a function named processPayment won’t surface when searching for billing, even if the two are tightly coupled.

The file system itself functions as external state storage under context pressure. Writing a PLAN.md, externalizing intermediate findings to scratch files, or structuring work as multiple checkpointed sessions are all valid strategies. The CLAUDE.md convention at project root functions as persistent context that survives compaction, because it re-enters the context window at position zero on every session start.

Error Recovery That Converges

Errors in the tool loop are just tool results. A failed edit produces a tool_result with is_error: true and enters the context window the same way a successful result does. The model reads it on the next inference call and decides how to proceed. This is where the architecture either converges toward a solution or spirals.

The spiral pattern is specific. The model attempts an edit, the edit fails due to a mismatched string, the model tries a slightly different version of the same edit, that also fails, the model tries a different approach that partially succeeds but leaves the file inconsistent. Each failure adds tokens. As the session grows longer, attention to the original error fades toward the middle of the context. Eventually the model starts making confident progress on subtasks while the original error persists unaddressed.

Mitigation strategies from production implementations:

Explicit retry budgets in the system prompt prevent indefinite spirals without requiring external orchestration. A constraint like “if the same approach fails twice, stop and describe the obstacle” forces escalation instead of repetition.

Requiring clean git state before starting any task means every session is recoverable by reverting. This is the minimal footprint principle applied to error recovery: side effects that can’t be undone are a liability in a system whose control flow is nondeterministic.

Syntax validation after file writes catches one common category of cascading error: an edit that introduces a parse error, which causes test output to fail with a confusing message, which causes the model to misdiagnose the problem as a dependency issue and start modifying unrelated configuration.

Where the Performance Gains Come From

SWE-bench measures AI agents solving real GitHub issues evaluated against project test suites. The original SWE-agent paper from Princeton NLP reported a 12.5% solve rate with a shell-access-centric approach. Current top entries on the Verified subset exceed 50%.

The gains aren’t primarily from model reasoning improvements. The agents that perform well are the ones managing the edit-run-observe cycle with the fewest unnecessary tokens, the clearest tool contracts, and the most explicit failure case handling. Agents with test execution capability substantially outperform file-modification-only agents at equivalent model capability, because running the tests after an edit provides the model with a ground-truth observation rather than forcing it to reason about correctness from static analysis alone.

Tool descriptions are load-bearing in ways that don’t show up in the loop code. A tool named record_id_to_permanently_delete behaves differently in practice than a tool named id, even when the underlying implementation is identical. The model reasons about confirmation and caution based on the name and description, before it decides whether to call the tool. This isn’t a quirk. It’s the mechanism by which behavioral constraints get encoded into the agent’s planning process without additional orchestration logic.

The Model Context Protocol standardizes tool definitions across runtimes, so a tool server written once can be consumed from Claude Code, Cursor, or any MCP-compatible host. The compounding effect of well-designed tools scales across every agent that uses them.

The eight-line loop is the foundation. The discipline described in Simon Willison’s agentic engineering series is everything built around it.