· 7 min read ·

The Agentic Loop Up Close: Context, Tools, and the Mechanics of Coding Agents

Source: simonwillison

Every coding agent in production today is built around the same mechanical loop. The model receives a context window containing a task description and a set of tool definitions, decides which tool to call and with what arguments, receives the result back as a message, and decides what to do next. Simon Willison recently published a detailed guide on how coding agents work that covers this clearly, and it is worth using as a starting point for a deeper look at two things the overview treats briefly: the tool schema as the agent’s real architectural boundary, and context accumulation as the constraint that shapes everything else.

The Loop in Concrete Terms

The agent loop maps directly onto the structure of LLM APIs. In Claude’s tool use API, a conversation is a sequence of user and assistant turns with tool_use and tool_result blocks interleaved. A single iteration looks like this:

// The model emits a tool call
{
  "type": "tool_use",
  "id": "toolu_01ABC",
  "name": "Bash",
  "input": {
    "command": "cargo test auth -- --nocapture 2>&1 | head -50"
  }
}

// The scaffolding appends the result
{
  "type": "tool_result",
  "tool_use_id": "toolu_01ABC",
  "content": "running 3 tests\ntest auth::test_token_expiry ... FAILED\n..."
}

The model sees the test failure in its next turn and decides what to read, change, or run next. The model is not executing code or reading files; it is emitting structured objects that describe what it wants done, and the scaffolding performs the actual work and feeds results back into context. OpenAI formalized this pattern as function calling in 2023. Anthropic shipped tool use for Claude 2.1 later that year. Both use JSON Schema to describe tool parameters. The underlying mechanism is identical; what differs across agents is which tools are in the schema and how they are designed.

The Schema Is the Interface

Each tool definition includes a name, a description, and a JSON Schema for its parameters. The model reads these at the start of every call and selects among them based on the current state of its reasoning. The quality of those definitions determines what the agent can perceive, what it can affect, and how precisely it can express intent.

Claude Code exposes a narrow, precise set: Read (with file_path, limit, and offset parameters), Write, Edit for targeted string replacement, Bash, Glob, Grep, WebFetch, and a few others. The granularity is deliberate. The distinction between Edit and Write is a meaningful architectural choice: a targeted edit keeps most of a file’s content out of the context window; a full rewrite requires reading and returning the entire file. A limit parameter on Read allows the model to fetch partial files and request more if needed. These design choices are not conveniences; they are context budget controls built into the interface.

Aider takes a structurally different approach. Rather than a multi-turn tool loop, it presents the model with selected file contents and a request, then asks the model to output a unified diff. There are no tool calls in the API sense; the diff format is the output schema, and the scaffolding parses and applies it. A response might look like:

--- a/src/auth/middleware.ts
+++ b/src/auth/middleware.ts
@@ -14,7 +14,7 @@ export function authMiddleware(req, res, next) {
-  if (!token || isExpired(token)) {
+  if (!token || isExpired(token, Date.now())) {
     return res.status(401).json({ error: 'Unauthorized' });
   }

Aider trades the flexibility of an open loop for predictability and lower context overhead. A diff is compact where a series of read/edit/read/edit tool calls is verbose, and it is unambiguous about what changes. Aider’s repo map reflects the same philosophy: rather than letting the model discover file structure through tool calls, it builds a ctags-style overview of the entire repository, function signatures and class hierarchies without file contents, and includes it in context from the start. The model gets an architectural map before any tool is invoked.

Cursor comes from the other direction. Before any model call, it uses embeddings to pull semantically relevant code into context. The model starts with more information but exercises less agency in gathering it; the tradeoff is between upfront relevance and the adaptive exploration the tool loop enables. When the retrieval surface matches what the model needs, this approach is faster and cheaper. When it misses something relevant, the model has no mechanism to discover the gap.

These are not just implementation details. They define the failure modes of each tool. An agent that discovers file structure through tool calls will make mistakes proportional to how well it formulates its search queries. An agent that starts with a precomputed map will miss things the map did not capture. An agent working from embedding retrieval may confidently work from code that was relevant in embedding space but not in call-graph terms. The right choice depends on task type and codebase structure.

Context Is the Real Constraint

Every tool call appends tokens to the context window. The model’s read of a file, the output of a test run, the result of a grep search, all of these accumulate. In a multi-file refactor or a debugging session that spans a codebase of moderate size, a 200k-token context window can be exhausted before the task completes.

Several factors make this worse than the raw numbers suggest. Models tend to re-read files rather than trust their own earlier reads, especially when about to write. Bash output is often verbose, and agents rarely truncate it intelligently before it reaches the model. Long compiler errors, dependency trees, and large grep results expand context faster than the task itself. Claude Code’s Read tool exposes limit and offset precisely to address this, but the model does not always use them conservatively when uncertain about what it will find.

Different scaffolds handle context pressure in different ways:

  • Aggressive truncation: Limit every tool result to a fixed number of lines or tokens, relying on the model to request more if needed. Simple to implement, but information loss is uncontrolled; the most important lines may be the ones cut.
  • Summarization: Replace detailed earlier turns with compressed summaries, sacrificing recall for extension of the session window. This works for tasks with clear phases but degrades on tasks where the model needs specific details from many steps back.
  • Stateless sub-tasks: Divide the task into discrete steps, run each in a fresh context, and pass only the output of each step forward. This is clean and predictable but loses the ability to course-correct based on observations from earlier steps.
  • Structural pre-summarization: Aider’s repo map approach, building a compact structural skeleton of the codebase upfront, reduces the tokens the model would otherwise spend on discovery while preserving the ability to request specific file contents.

No strategy dominates. Long-running context gives the model continuity across many files and tool calls and the ability to notice patterns that span steps. Truncated or summarized context is cheaper and avoids some failure modes around over-reliance on stale reads. Production agents typically combine strategies based on task phase, not because any single approach is insufficient but because the tradeoffs shift as the task progresses.

The Scaffolding Is Not the Packaging

One of the durable points in Willison’s piece is how much of a coding agent’s competence lives in the scaffolding rather than the model. The model contributes reasoning about what to change and why. The scaffolding contributes file I/O safety, tool output normalization, context budget management, error formatting, and loop termination logic.

This matters when evaluating or building these systems. A capable model paired with poor scaffolding will fail at structurally straightforward tasks. The model will read more context than it needs because the scaffolding does not compress tool results. It will get confused by ambiguous error messages because the scaffolding formats them inconsistently. It will fail to terminate because the scaffolding does not detect that the model has started repeating itself.

Swapping the model while keeping the scaffolding is, in most cases, a smaller change than the reverse. This is part of why Aider improved substantially as underlying models improved: the scaffolding handled the structural work, and each generation of model provided better reasoning over the same loop. The Anthropic agent SDK and similar libraries provide the loop machinery itself, but the tool definitions, context budget logic, and error handling belong to the developer. Getting those right matters as much as model selection.

This also means the choice of tools exposed to the model is an API design problem. Overlapping tool scopes produce ambiguity; the model picks between Edit and Write based on description semantics, and if those descriptions are unclear it will pick inconsistently. Parameters that conflate two concerns produce malformed calls. A tool that accepts either a file path or a glob pattern in the same field forces the model to guess the interpretation. These are the same problems that appear in any public API design, applied to an interface where the consumer is reasoning probabilistically.

What the Loop Tells You When It Fails

Coding agents fail predictably once you understand the loop. They lose track of state when context pressure forces summarization and critical details from earlier steps are compressed away. They make redundant tool calls when prior results were ambiguous or truncated. They drift from a stated plan when a tool result implied a different path but the model did not update its working approach, either because the context window was too full to attend to it or because the result was formatted in a way that obscured its significance.

For a misbehaving agent, the productive debugging question is about context: what did the model see, in what order, and what was absent from the picture. The answer to most agent failures is visible in the conversation history if you read it the way the model does, as an accumulation of observations and actions where each decision was made with only what came before it available.

The context window is program state, and the tool schema is the instruction set. Treating them as first-class design surfaces rather than configuration details is what separates scaffolding that extends a model’s capability from scaffolding that constrains it without the developer noticing.

Was this interesting?