Engineering Resilient Agent Systems: What Distributed Systems Got Right First
Source: simonwillison
Simon Willison’s guide to agentic engineering defines the discipline precisely: an agentic system is one where a language model drives a loop, calling tools and observing results until a goal is met. The framing that has stuck with me, though, is not the loop itself. It is the phrase “unreliable external dependency,” which several posts on the subject have used to describe the LLM at the core of any agent system. That phrase is where distributed systems thinking enters the picture, and it reframes nearly every hard problem in agentic engineering as a problem engineers have already solved in a different context.
Building microservices taught an entire generation of engineers how to design around unreliable components: remote calls fail, latency spikes, partial responses corrupt state, cascading failures take down whole systems. The solutions that emerged, circuit breakers, idempotent operations, explicit error budgets, structured tracing, are textbook now. What is less appreciated is how precisely these solutions transfer to the problems of building reliable agent systems.
The Circuit Breaker Pattern for Agent Loops
A circuit breaker in distributed systems monitors calls to a downstream service and, once failures exceed a threshold, stops making those calls for a period rather than continuing to hammer a degraded dependency. The canonical implementation by Martin Fowler has three states: closed (operating normally), open (failing fast), and half-open (probing for recovery).
Agent loops have an analogous failure mode: a model that has lost track of its task, or that is retrying a broken tool call, will continue spending tokens and API budget indefinitely without converging. The failure is not a hard crash but a degraded loop that produces no useful work. A simple circuit breaker applied to the agent loop protects against this:
class AgentCircuitBreaker:
def __init__(self, max_tool_failures: int = 3, max_turns: int = 25):
self.tool_failures = 0
self.turn_count = 0
self.max_tool_failures = max_tool_failures
self.max_turns = max_turns
def record_tool_failure(self, tool_name: str, error: str):
self.tool_failures += 1
if self.tool_failures >= self.max_tool_failures:
raise AgentCircuitOpenError(
f"Circuit opened after {self.tool_failures} tool failures. "
f"Last failure: {tool_name}: {error}"
)
def record_turn(self):
self.turn_count += 1
if self.turn_count >= self.max_turns:
raise AgentTurnBudgetExceeded(
f"Agent exceeded {self.max_turns} turns without completing the task"
)
This is deliberately simple. The thresholds should be configurable per task type: a research agent that reads many documents needs a higher turn budget than a code review agent with three well-scoped tools. The point is not the specific numbers but the principle of treating runaway model behavior as a circuit condition rather than waiting for a token or cost limit to terminate the run externally.
The AutoGPT failure modes in 2023 were largely the absence of this kind of guard. Agents would enter planning spirals, repeatedly decomposing tasks they could have executed directly, consuming context and API credits until external intervention. The architecture assumed the model would self-terminate at the right point. The circuit breaker pattern does not make that assumption.
Idempotency in Tool Design
In distributed systems, idempotent operations can be safely retried without side effects: a PUT to update a resource is idempotent; a POST to create a new record is not. Designing APIs for idempotency is standard practice because retries are inevitable in unreliable networks.
The same principle applies to agent tools, but with an asymmetry the agent has no awareness of. When a tool call fails, the model does not know whether the operation completed before the failure or not. If write_file times out after writing half a file, the agent sees an error and may retry, resulting in a corrupt file or a duplicate. If send_notification fails after the notification was sent, a retry sends two notifications.
The engineering response is to design tools with explicit idempotency where the stakes are high:
def write_file(path: str, content: str, idempotency_key: str | None = None) -> dict:
"""
Write content to a file. If idempotency_key is provided and a write
with this key has already completed, returns the previous result
without re-writing. Use for operations the agent may retry.
"""
if idempotency_key:
if previous := _check_idempotency_store(idempotency_key):
return {"status": "already_completed", "result": previous}
result = _do_write(path, content)
if idempotency_key:
_record_idempotency(idempotency_key, result)
return {"status": "written", "path": path, "bytes": len(content)}
This is more verbose than most agent tool implementations, but it reflects the reality that the model retries on ambiguous failures more often than most developers expect. Anthropic’s tool use documentation notes that models will often retry a tool with modified input after an error, which means your tool may receive multiple calls for what the model believes is one logical operation. Designing for this is cheaper than debugging the consequences of not designing for it.
The minimal footprint principle that appears throughout Willison’s guide and Anthropic’s multi-agent documentation is, at its core, an idempotency argument: prefer reversible operations over irreversible ones, because reversible operations can be safely retried, inspected, and undone if the model made the wrong call.
Error Budget Thinking Applied to Agent Reliability
Site reliability engineering introduced the error budget concept: a service with a 99.9% uptime SLO has a budget of 43 minutes of downtime per month. When the budget is exhausted, new deployments stop until reliability improves. This framing shifts the question from “is this reliable enough?” to “how much unreliability are we accepting and where?”
Agent systems benefit from the same framing, applied differently. An agent that accomplishes its task correctly on 75% of runs might be entirely acceptable for a low-stakes internal tool and completely unacceptable for a customer-facing workflow. The engineering question is not “how do we make the agent perfect?” but “what is the reliability budget for this task, and what is our strategy when the budget is exceeded?”
This has practical implications for how you handle agent failures in production. Rather than logging a failure and moving on, a budget-aware design tracks the failure rate against a defined acceptable threshold and escalates when the threshold is crossed. For Discord bot development, this might mean routing to a simpler fallback response when the agent-based handler has been failing more than expected over the last hour, rather than continuing to attempt a full agentic workflow that is clearly degraded.
Distributed Tracing as the Observability Model
The observability model for distributed systems is the trace: a structured record of all calls, their timing, their inputs and outputs, and their causal relationships. Tools like Jaeger, Zipkin, and the OpenTelemetry standard exist because logs alone are insufficient to diagnose failures in systems where a single user request fans out across many services. You need to reconstruct causality.
Agent runs require the same model. A flat log of model API calls and tool invocations tells you what happened but not why, and correlating a bad final output with the specific tool call that set the agent off course requires reconstructing a causal chain from sequential log entries. Products like LangSmith and Weights and Biases Weave exist specifically because the distributed tracing mental model is the right one for agent observability.
The minimum viable implementation without these tools is a trace object that propagates through the agent loop and each tool call:
@dataclass
class AgentTrace:
trace_id: str
spans: list[Span] = field(default_factory=list)
def span(self, name: str, inputs: dict) -> "SpanContext":
return SpanContext(self, name, inputs)
@contextmanager
def SpanContext(trace: AgentTrace, name: str, inputs: dict):
span_id = str(uuid.uuid4())
start = time.monotonic()
span = Span(id=span_id, name=name, inputs=inputs, parent_trace=trace.trace_id)
try:
yield span
span.duration_ms = (time.monotonic() - start) * 1000
span.status = "ok"
except Exception as e:
span.duration_ms = (time.monotonic() - start) * 1000
span.status = "error"
span.error = str(e)
raise
finally:
trace.spans.append(span)
Every model call and every tool invocation should generate a span within the same trace. When a run produces a bad result, you reconstruct the full causal sequence from the trace rather than searching through linear logs. This is the same reason distributed systems engineers instrument at the call site rather than relying on centralized logging.
Explicit Error Types Across Agent Boundaries
In distributed systems, the distinction between transient and permanent errors matters because it determines retry strategy. A 503 Service Unavailable is often retryable; a 400 Bad Request for malformed input is not. Systems that treat all errors identically either retry things that should not be retried or give up on things that would have succeeded on the next attempt.
This same distinction matters in agent tool design, but most tool implementations return unstructured error strings that force the model to infer which category applies. A tool that returns "Error: file not found" versus one that returns {"error_type": "not_found", "recoverable": true, "suggestion": "Use glob_files to find the correct path"} will produce substantially different model behavior on retry, because the structured response gives the model explicit information about what to do next rather than requiring it to guess.
Anthropics guidance on error handling in tool use notes that returning structured errors in tool results, rather than raising exceptions that collapse the tool call, allows the model to respond intelligently to failures. The distributed systems analogy makes clear why: you are designing the error protocol for a service the model will call autonomously, and the quality of the error information determines the quality of the model’s recovery behavior.
What This Framing Gets Right
Framing agentic engineering through distributed systems thinking does not resolve every hard problem. The non-determinism of model behavior, the context window as finite working memory, the prompt injection vulnerability surface - these have no direct equivalent in traditional service-oriented architecture. The patterns that make microservices resilient are necessary but not sufficient.
What the framing provides is a starting point that most engineers already have mental models for. An engineer who has implemented circuit breakers, designed idempotent APIs, and thought carefully about error budgets has most of the intuitions needed to build reliable agent systems. The transfer is not perfect, but it is close enough that reaching for distributed systems patterns as defaults will produce better first drafts than approaching agentic systems as an entirely novel engineering problem.
Willison’s characterization of agentic engineering as “the engineering discipline you apply to systems where a probabilistic reasoning process drives real side effects” is correct. Distributed systems engineers have been managing probabilistic failure modes driving real side effects for decades. The vocabulary translates, the tooling transfers, and the instincts carry over more directly than most AI-focused writing acknowledges.