· 6 min read ·

The Loop at the Center: How Coding Agents Actually Work

Source: simonwillison

Simon Willison published a thorough guide on agentic engineering patterns that covers how coding agents work from the inside. It’s worth reading. But there’s a specific lens I keep coming back to when thinking about these systems: everything interesting about a coding agent is a consequence of one primitive, the tool-call loop, and understanding that loop clearly changes how you read agent behavior, debug agent failures, and design agent-adjacent tooling.

Let’s go through the mechanics.

The Loop Is the Agent

At its core, every coding agent is a while loop wrapped around an LLM API call. The structure is:

  1. Build a prompt with the current state (system prompt, conversation history, tool results so far)
  2. Send it to the model
  3. The model responds with either a final answer or a tool call
  4. If it’s a tool call, execute the tool, append the result to the conversation, and go back to step 1
  5. If it’s a final answer, stop and surface it to the user

This is sometimes called the ReAct pattern (Reasoning + Acting), described in a 2022 paper from Google. The model alternates between “thought” (chain-of-thought reasoning in its response) and “action” (a structured tool invocation). The tool result becomes an “observation” that feeds back into the next iteration.

In pseudocode:

def agent_loop(task: str, tools: list[Tool]) -> str:
    messages = [{"role": "user", "content": task}]
    
    while True:
        response = llm.complete(messages=messages, tools=tools)
        
        if response.stop_reason == "end_turn":
            return response.content
        
        # Model requested tool use
        tool_result = execute_tool(
            name=response.tool_name,
            input=response.tool_input
        )
        
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_result})

Every coding agent you’ve heard of, Claude Code, Aider, Cursor’s agent mode, GitHub Copilot Workspace, Devin, is this loop with different tools plugged in and different scaffolding around it.

What the Tools Actually Are

The tools a coding agent gets are what determine its effective capabilities. Common ones:

  • bash or execute_command: run shell commands in a sandboxed environment
  • read_file / write_file: file I/O with explicit paths
  • search or grep: content search across the codebase
  • glob / list_directory: filesystem navigation
  • web_fetch: pull in external documentation or specs

The tool descriptions matter as much as the implementations. The model reads these descriptions to decide which tool to call and how to call it. A vague description leads to misuse; an overly restrictive description leads to the model inventing workarounds through bash when a safer tool exists. Simon’s guide frames this well: the tool description is effectively an API contract between your scaffolding and the model.

There’s also the question of what the tool returns. A bash execution that returns a 40,000-character stack trace is not helpful. Good scaffolding truncates outputs, normalizes error messages, and structures results so the model can reason about them without blowing the context window.

The Context Window Is the Process State

The context window is not just a memory limit. It’s the entire state of the agent’s execution. Everything the agent “knows” about a task is the accumulated text of the conversation so far: the original instruction, every tool call it made, every tool result it received, and every piece of reasoning it generated.

This has a few concrete implications.

First, agent loops are token-expensive. A reasonably complex task might involve 10-15 tool calls. If each turn adds 2,000 tokens of history, you burn through 20-30k tokens on the accumulated context before even counting the actual file contents you’re reading. On a model with a 200k token window (like Claude 3.5 Sonnet), this is comfortable. On older models with 8k or 16k limits, it was a hard constraint that forced different architectural tradeoffs.

Second, what you put in the system prompt matters a lot. Most coding agents front-load substantial context: tool descriptions, project structure, coding conventions, safety constraints. Anthropic’s published usage policies and the Claude Code documentation give glimpses of how much scaffolding goes into shaping model behavior before the first user message.

Third, context management is where agents diverge in quality. Naively appending every tool result to the conversation causes two problems: it consumes tokens fast, and it buries earlier reasoning under irrelevant noise. Sophisticated agents summarize completed subtasks, drop stale tool results, and use techniques like prompt caching (supported by Anthropic’s API via cache_control) to reduce the cost of repeated system prompt overhead across turns.

# Example of using prompt caching for a stable system prompt
messages_with_cache = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_SYSTEM_CONTEXT,  # project files, conventions, etc.
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_task
            }
        ]
    }
]

With prompt caching, the cached portion of the context is stored server-side for up to 5 minutes (Anthropic) and costs 0.1x the normal input token price on cache hits. For agents that run many short tasks against the same codebase context, this can cut costs by 60-70%.

Scaffolding and the Illusion of Autonomy

When a coding agent “decides” to read a file before editing it, that decision is partly the model’s reasoning and partly the scaffolding constraining which actions are available. Agents that only expose read_file, write_file, and bash will explore the codebase differently than agents that also expose a structured search_symbols tool backed by a real language server.

Aider, the open-source Python agent, makes an interesting architectural choice here: it sends the model a repository map (a compact representation of all symbols and their file locations) at the start of each session. This pre-loading reduces the need for exploratory tool calls and lets the model plan a full edit before starting execution. The tradeoff is upfront token cost; the benefit is fewer turns, faster edits, and less risk of the model losing the thread across many iterations.

Cursor takes a different approach with its IDE integration. It maintains an in-process index of the codebase and injects relevant context dynamically into each prompt based on what the model is working on, without requiring explicit tool calls for most navigation. The agent loop is still there, but the scaffolding does more pre-work so the model sees higher-signal context.

Where Agents Fail, and Why the Loop Explains It

Most coding agent failures trace back to one of three loop-level problems.

The first is context corruption. The model makes an early incorrect assumption, which gets baked into the conversation history, and every subsequent tool call builds on that false premise. Because the model’s “memory” is just the text of the conversation, it can’t easily discard a flawed assumption the way a human would step back and re-evaluate. Good scaffolding adds explicit re-evaluation checkpoints or gives the model a mechanism to flag uncertainty and restart a subtask.

The second is tool call avalanche. Without explicit limits, a model can chain together many small tool calls to gather information rather than committing to a plan. I’ve watched agents run 40+ file reads across a medium-sized codebase before making a single edit. Some agents cap the number of tool calls per task; others use a two-phase structure (explore, then execute) enforced at the scaffolding level.

The third is hallucinated tool inputs. The model generates a tool call with plausible-looking but incorrect inputs, for example a file path that doesn’t exist, or a bash command that syntactically valid but semantically wrong. Good tools return structured errors that the model can reason about. Bad tools return nothing or crash, which either silently poisons the context or terminates the loop early.

Building Mental Models

If you’re building tooling around coding agents, or just trying to understand why a particular agent did something surprising, the loop framing is the right mental model. Ask: what was in the context at that point, what tools were available, and what would the model have reasonably inferred from the conversation history so far.

Agent behavior is not mysterious. It’s deterministic given the prompt. The prompt is just the accumulated state of the loop. Work backwards from there and most behavior becomes legible, including the failures.

The Anthropic documentation on tool use and Simon Willison’s guide on agentic patterns are both good reference points if you want to go further. But the loop is where to start.

Was this interesting?