· 7 min read ·

Why the Agent Loop Is a Distributed Systems Problem in Disguise

Source: simonwillison

Simon Willison’s guide on agentic engineering opens with a clear boundary: agentic engineering is a distinct engineering discipline, not an extension of prompt crafting. The framing is right, but the guide doesn’t dwell on why the loop creates a category of problem that prompt skill cannot address. That gap is worth exploring, because it connects agentic systems to a body of knowledge engineers already have.

The agent loop is a distributed system in the structural sense, not merely because it might run across multiple machines, but because it has the properties that make distributed systems hard: non-deterministic execution, partial failures that are difficult to observe, state that is implicit rather than explicit, and no guarantee of exactly-once semantics. Recognizing these properties is what separates an engineered agent from one that works most of the time and fails mysteriously the rest.

The Loop Itself

A minimal agent loop in Python using the Anthropic SDK looks like this:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "read_file",
        "description": "Read the contents of a file at a given path.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Absolute file path"}
            },
            "required": ["path"]
        }
    }
]

messages = [{"role": "user", "content": task}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages
    )

    if response.stop_reason == "end_turn":
        break

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = execute_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })

    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})

The loop is about fifteen lines. The complexity lives in everything execute_tool does not show: whether the tool can fail, how it signals failure, whether it has side effects, and whether those side effects are reversible. These questions have no prompting analogs. They are software design questions.

The Context Window Is Mutable State

In ordinary programs, state lives in memory structures you control. You can inspect it, serialize it, and reason about it statically. In an agent loop, the state is the conversation history accumulated across turns in the context window. That history is the model’s working memory for the current task.

This has direct engineering consequences. The context has a fixed maximum size. A task that requires many tool calls will eventually approach the limit. When that happens, most frameworks either truncate older messages or run a summarization step. Both can lose information, and the model continues with an incomplete picture of what it was trying to accomplish. Claude Code’s context compaction approaches this by periodically summarizing prior tool use into a condensed representation, preserving the semantics of past work without carrying every raw tool result forward through the conversation.

The context window is also an isolation boundary. When you spawn a subagent via a fresh API call, that agent starts with whatever you pass explicitly. It has no ambient access to the parent’s accumulated state. This mirrors fork semantics: the child inherits only what you hand it. Any shared context must be serialized and passed deliberately.

The implication is that important decisions made mid-task should be recorded explicitly rather than left to survive through context attrition. If the agent decides partway through that one approach is better than another, and that decision will matter in a later step, it should be written somewhere durable or surfaced prominently in the message history rather than buried under intervening turns.

Tool Descriptions Are the API Contract

The description field in a tool definition is what the model uses to decide whether and how to call that tool. It is the primary interface between the engineer and the model’s decision process, and it is where most agent-specific logic errors originate.

A vague description creates a class of failure with no equivalent in traditional software. If search_code and read_file have overlapping descriptions, the model will sometimes choose the wrong one. Unlike a type mismatch, this failure will not surface immediately. The wrong tool may return a plausible-looking result that the model uses to continue, producing output that appears correct on the surface and fails at integration.

Parameter schemas carry the same weight. A parameter described as “options for the operation” produces unpredictable behavior because the model has to infer what “options” means from surrounding context. A parameter described as “comma-separated list of file extensions to include, e.g. py,ts,go” leaves no ambiguity.

Experienced practitioners treat tool schemas with the same rigor as public API design, because that is what they are. The conventions differ (natural language descriptions rather than type signatures), but the underlying goal is identical: eliminate the need for the client to guess.

Partial Failure and Silent Errors

When a function call fails in a deterministic program, you get an exception with a stack trace. The failure is loud and localized. When a tool call fails inside an agent loop, the model might retry with different parameters, might decide the tool is unavailable and pivot to an alternative, might misread an error message and draw a wrong conclusion that propagates through later steps, or might complete the task by generating plausible-sounding output based on incomplete information, with no signal that anything went wrong.

The last case is the hardest to defend against because the model will always produce text, including text that describes success when success did not happen. A tool that returns an empty string on failure creates this problem: the model may interpret the empty string as “the file exists but is empty” and continue from there. A tool that returns {"error": "file not found", "path": "/tmp/data.csv"} is unambiguous.

Tool return values are part of the engineering contract, not incidental output. Designing what failure looks like at the tool boundary is as important as designing what success looks like. The signal the tool emits on failure determines whether the agent can recover or whether it proceeds down a divergent path.

Prompt Injection Scales with Depth

In a single-turn LLM call, prompt injection risk is bounded. There is one input, one output, one opportunity for injected content to redirect the model’s behavior.

In an agent loop, every tool result is a new injection surface. If an agent reads a web page as part of its task, that page’s content enters the conversation. If the page contains text structured to resemble system instructions, the model may follow them. This has been demonstrated against production coding agents and browser automation systems as reproducible attacks, not just theoretical concerns.

The risk compounds in multi-agent systems. An orchestrator that spawns worker agents passes each worker a subset of the overall task. If a worker reads injected content and is redirected into an unintended action, the orchestrator may have no visibility into what happened. The blast radius depends on what tools the worker holds. A worker with read-only access to a codebase is far less dangerous than one with write access and shell invocation.

The practical mitigations are architectural. Least-privilege tool sets limit what an injected instruction can cause. Explicit human confirmation steps before irreversible actions break the automatic execution chain. Treating tool results as untrusted input rather than trusted context changes how you reason about what the model is allowed to do with them. None of these are complete defenses, but all of them raise the cost of a successful attack.

The Coordination Tax in Multi-Agent Systems

Parallel subagent execution is appealing because tasks that decompose into independent units should complete faster. An orchestrator identifies parallelizable subtasks, assigns each to a worker with a focused context, and assembles the results.

The cost is coordination. Subtasks that appear independent often have implicit dependencies that only surface at runtime. Two workers modifying files in the same directory can produce conflicts. Workers generating freeform prose create a burden on the orchestrator to parse and reconcile their results. The assembly step must understand what each worker produced, which means workers need structured outputs rather than narrative summaries.

The architectures that work well treat subagent invocations like remote procedure calls: defined inputs, defined outputs, no shared mutable state. LangGraph models coordination as an explicit state graph where transitions between agents are typed and defined. Anthropic’s tool use documentation encourages passing explicit context objects rather than assuming ambient state. Both approaches reflect the same observation: the context window boundary that looks like a limitation is also an isolation guarantee, and isolation is what makes parallel execution tractable.

The AutoGen framework from Microsoft Research pushes this further, modeling multi-agent conversations as explicit message-passing protocols between typed agents. The protocol is the coordination, and the protocol can be tested independently of the models that implement it.

The Engineering in Agentic Engineering

The word “engineering” carries specific weight in this context. Engineering implies predictable behavior under specified conditions, design for failure, and the ability to reason about a system before it runs rather than only after.

Agentic engineering borrows from API design, because tool schemas are interfaces. It borrows from distributed systems, because the loop has the failure modes of distributed execution. It borrows from security engineering, because untrusted data enters the decision process at every tool boundary. It does not borrow heavily from prompt craft, which concerns itself primarily with the quality of a single-turn exchange.

The productive engineering questions are: what does failure look like at each tool boundary, what state needs to survive a context summarization, what can an injected instruction cause, and where does the task require a human confirmation before proceeding. These questions have tractable answers. The agent loop is a program, and it can be designed with the same rigor as any other program, provided you treat it as one.

Was this interesting?