· 6 min read ·

The Loop Is the Agent: What Actually Happens Inside a Coding Tool

Source: simonwillison

Every coding agent, underneath its interface, is a loop. The model generates output, some of that output is tool calls, scaffolding code executes those calls, results go back into the context, and the model runs again. Simon Willison’s guide on agentic engineering patterns explains this clearly. What the guide leaves as an exercise is what happens when you stress that loop against a real codebase with real failures, filling context windows and hitting edge cases that no system prompt anticipated.

This post digs into the mechanics: the actual API shape, why context management is the hard problem, and what separates a brittle agent from one that recovers gracefully.

The Tool Call Exchange

When a model like Claude decides to read a file, it does not read anything. It produces a structured JSON object inside its response that looks like this:

{
  "type": "tool_use",
  "id": "toolu_01XKpq7z",
  "name": "read_file",
  "input": {
    "path": "src/commands/ping.ts"
  }
}

The scaffolding, the code you write or that ships with a product like Claude Code, intercepts that, opens the file, and produces a tool result:

{
  "type": "tool_result",
  "tool_use_id": "toolu_01XKpq7z",
  "content": "import { SlashCommandBuilder } from 'discord.js';\n..."
}

That result goes into the next API call as part of the messages array. The model never “has” the file; it has the text of the file in its context window for as long as that window persists. This is a meaningful distinction. The model is not stateful in the way a running program is. Everything it knows about the current task lives in the token stream.

The Anthropic tool use documentation specifies this exchange format in detail. OpenAI’s function calling API uses a structurally similar pattern, which is why most agent scaffolding can be adapted between providers without rewriting the core logic.

What Tools a Coding Agent Actually Needs

The minimal useful set for a coding agent is smaller than you might expect:

  • Read file: returns file contents, possibly with line numbers
  • Write file or str_replace: create or patch files
  • Bash/shell execution: run tests, build commands, grep, git
  • List directory: enumerate the filesystem
  • Search/grep: find patterns across files

Bash execution is the most powerful and the most dangerous tool in this set. Because it collapses everything into a single interface, an agent can run pytest, pipe output through jq, check git status, and install a dependency all through one tool. This flexibility is also why prompt injection is a real concern: if an agent reads a file that contains instructions disguised as user input, a naive implementation will follow them.

More capable agents add LSP integration, which gives them semantic information beyond text search: go-to-definition, find-all-references, type information. Cursor has invested heavily in this layer, and it matters for large codebases where a grep for a function name returns forty false positives from comments and test fixtures.

The Context Window Is the Bottleneck

Every file read, every shell output, every assistant turn accumulates in the context window. For a model with a 200k token window, that sounds generous until you start reading TypeScript declaration files, long test outputs, or the stdout of a verbose build system.

Coding agents deal with this in a few ways:

Truncation with markers. Bash output gets capped at some character limit, with a note that it was truncated. The agent has to decide whether to re-run with a more targeted command or proceed with partial information.

Summarization. When the context approaches its limit, the scaffolding can ask the model to produce a compact summary of what has happened so far, then start a fresh context with that summary prepended. Claude Code calls this “compaction.” It works reasonably well for sequential tasks but loses fine-grained detail that might matter later in the session, such as a specific error message from ten tool calls ago.

Retrieval. Some systems embed file contents and retrieve only relevant chunks based on the current task. This is more complex to implement but scales better to very large codebases. Sourcegraph’s Cody uses a retrieval layer rather than brute-force file inclusion.

The fundamental issue is that context is not memory. A human developer accumulates mental models across days of working in a codebase. An agent starts fresh every session and has to re-derive its understanding of the codebase from scratch, reading files it has read before, re-discovering patterns it identified last week. Persistent memory across sessions, where the agent stores structured notes about the codebase that survive between conversations, is an open problem that most tools handle poorly.

The Failure Modes That Matter

An agent that only works when nothing goes wrong is not useful. The interesting design question is how an agent handles the common failure cases.

Tool execution errors. When bash returns a non-zero exit code, the agent sees the stderr output in the tool result. A well-designed agent reads that output, reasons about what went wrong, and tries a different approach. A poorly designed one retries the same command, gets the same error, and loops until the scaffolding hits a maximum iteration limit.

Ambiguous task scope. If asked to “fix the tests,” an agent has to decide how far to go. Does it fix the test file? Modify the implementation? Add a new dependency? Agents that do not ask clarifying questions before starting tend to make confident, wrong assumptions and do significant work in the wrong direction before the user notices.

Infinite loops. Without a hard iteration cap, an agent can chase its tail indefinitely, each tool call producing output that suggests another tool call. Every production agent needs an explicit limit on the number of turns, distinct from the context window limit, and should surface the reason for stopping to the user rather than silently giving up.

Context poisoning. If an agent reads a malformed or adversarial file early in a session, that content sits in the context for the rest of the session. The model cannot ignore it the way a human might recognize and discount irrelevant information. This is one reason some agents sanitize file contents before inserting them into context.

The Scaffolding Is the Product

The model is a component. The scaffolding, the system prompt, the tool definitions, the loop control logic, the context management strategy, the error handling, is where most of the real engineering happens in building a coding agent.

This is underappreciated in discussions that treat “GPT-4” or “Claude” as the product and the surrounding code as plumbing. Two agents running the same underlying model with different scaffolding will produce dramatically different results on the same task. The system prompt alone, which sets the model’s persona, its available tools, and its rules for when to ask versus when to proceed, shapes behavior more than a model version bump in most cases.

The Anthropic system prompt for Claude Code is public, which is worth reading in full. It is explicit about when the agent should stop and ask, what it should and should not modify without permission, and how it should communicate uncertainty. These constraints are load-bearing. Remove them and the agent becomes either timid to the point of uselessness or recklessly autonomous.

Building a coding agent from scratch, even a simple one, is a useful exercise. Wire up the Anthropic or OpenAI API with a bash tool and a file read/write tool, write a system prompt, and run the loop. Within a few hundred lines of code you have something that can make meaningful changes to a codebase. What you will also have is a clear view of every edge case that the polished products have spent months smoothing over.

Was this interesting?