The Distributed Systems Problems Hidden Inside Every Agent Loop
Source: simonwillison
Agentic engineering, as Simon Willison defines it in his guide to agentic engineering patterns, is the practice of building systems where a language model drives execution: the model reads context, decides what to do next, calls tools, processes results, and continues until the task is done. If you have built these systems, you have probably noticed that the engineering problems they introduce feel familiar. They are the same problems that show up in distributed systems, dressed differently.
The Loop Introduces State
A single LLM call is essentially a pure function. Given the same input, it produces similar output. Agents are not pure functions; they are processes. The context window accumulates state over the course of a run. Tool results get appended. Earlier instructions recede into the background. By step 15 of a complex task, the model is reasoning over a transcript that has grown substantially from where it started.
This is the same problem as managing mutable state in a concurrent system. The context is shared state, and every tool result is a write to that state. You do not have locks or transactions, but you do have sequencing. If a tool call produces output that contradicts an earlier part of the context, the model has to reconcile the conflict, and it does so probabilistically.
Strategies that work in distributed systems work here too. Keep state representations small and canonical. Summarize when possible rather than accumulating raw outputs. Design tools that return the minimum information needed for the next decision rather than full dumps of whatever they touched. The Anthropic documentation for building agents discusses context management explicitly, and the same considerations apply when using any provider, since the underlying constraint is universal: the context window is finite, and reasoning cost grows with length.
Tool Interfaces Are API Contracts
The tool definitions you write are API contracts between your application and the model. The model reads them at inference time and uses them to decide what to call and how. A vague tool description produces ambiguous calls; a poorly structured return format produces confused reasoning about what the call accomplished.
Consider two ways to expose a file-reading operation to an agent:
{
"name": "read_file",
"description": "Read a file",
"parameters": {
"path": {"type": "string"}
}
}
versus:
{
"name": "read_file",
"description": "Read the contents of a file at the given path. Returns the file contents as a UTF-8 string. Raises an error if the path does not exist or is not readable. Use this when you need to inspect the contents of a specific file.",
"parameters": {
"path": {
"type": "string",
"description": "Absolute or relative path to the file"
}
}
}
The second version tells the model what the tool does, what it returns, what can go wrong, and when to use it. That information is not for your benefit; you already know how the tool works. It is for the model. Investing time in tool descriptions is API design, and the same principles apply: be specific, handle errors gracefully, and make the interface reflect the mental model you want the caller to have.
Both LangChain’s tool documentation patterns and Anthropic’s tool use documentation emphasize description quality as a primary factor in call accuracy. The underlying reason is that the model is doing natural language matching between your description and its understanding of the task at hand, so imprecise language produces imprecise behavior.
Non-Determinism Compounds Over Steps
A model that does the right thing 95% of the time on each step is not a model that produces correct 10-step plans 95% of the time. If the steps were independent, you would have 0.95^10, which is about 60%. In practice, errors propagate. A wrong tool call in step 3 appends incorrect information to the context, and subsequent steps reason from that incorrect premise, so the compound failure rate is often worse than the independent calculation suggests.
This is the same reliability math that applies to distributed systems. Services with high individual reliability produce unreliable chains when composed without fault tolerance. The mitigations are also analogous: retry logic, checkpointing, idempotent operations where possible, and circuit breakers that stop the chain when something has clearly gone wrong.
For agents, checkpointing means saving the state of the context at key decision points so you can resume rather than restart from scratch. Idempotency means designing tools so that calling them twice with the same arguments does not produce worse outcomes than calling them once. Circuit breakers mean building explicit stopping conditions into your agent loop: if the model has called the same tool with the same arguments three times in a row, something has gone wrong and continuing will not fix it.
Prompt Injection Scales with Autonomy
One failure mode that has no direct analog in traditional distributed systems is prompt injection. When an agent fetches a URL, reads a file, or processes user-provided content, that content enters the context alongside your instructions. If the content contains instructions of its own, the model may follow them.
Simon Willison has documented many prompt injection examples in substantial detail over several years. The risk is directly proportional to what the agent can do. An agent that only generates text is a limited target; an agent that can send email, write files, or call external APIs is a much more valuable one to compromise. The mitigations are still an open research problem, but practical steps include treating all externally sourced content as untrusted input, designing tools with minimal blast radius, and keeping human oversight in the loop for irreversible actions.
This maps roughly to the principle of least privilege in traditional systems security. An agent that needs to read files to accomplish a task should not also have write access unless the task specifically requires it. The surface area you give an agent determines the damage radius of a successful injection.
Testing Requires a Different Mental Model
Unit tests assert specific outputs. Agentic systems are non-deterministic, so you cannot write assertions against specific outputs without brittle fragility. What you can test is distributions.
Build evaluation harnesses that run a task many times and measure the fraction of runs that produce acceptable outcomes. Log every tool call, every context append, and every model response. When a run fails, the trace tells you where it diverged from the correct path. This is closer to how you debug distributed systems, with structured tracing and log aggregation, than how you debug a function with a debugger.
The OpenAI evals framework and Anthropic’s guidance on model evaluation both reflect this orientation: you are measuring behavior over distributions, not asserting behavior on individual inputs. The evaluation suite is the test suite, and it runs against a sample of tasks rather than a fixed set of expected outputs.
What This Engineering Work Requires
Building agents is not a matter of writing a clever prompt and wiring up some tools. The loop introduces state management requirements. The tool interface is a real API that needs design investment. Non-determinism means your reliability thinking has to work in probabilities, not certainties. Prompt injection means your security model has to account for untrusted content in the model’s reasoning context.
None of these are insurmountable problems. Distributed systems engineers worked through analogous problems over decades and built frameworks and mental models that made them tractable. Agentic engineering is at an earlier stage, but the inheritance is clear. The instincts that make someone good at distributed systems work, skepticism about state, care about interface boundaries, explicit handling of failure cases, transfer directly to building agents that behave reliably in production.