What Event-Driven Engineering Already Knows About Agent Reliability
Source: simonwillison
The tool loop at the center of every LLM agent is an event loop. When an agent calls a tool, it emits an event; when the tool result returns, the model handles it; context updates, side effects occur, another event may be emitted. The cycle continues until no more events are pending.
Simon Willison’s guide on agentic engineering defines the field as what emerges when you add a feedback loop to an LLM call. That definition is precise and useful. The loop itself, though, is not novel technology. The engineering implications of building reliable systems around event-processing loops that consume external inputs and produce side effects have been studied carefully in message-driven and event-driven architecture for decades. Most of what makes agents hard to build reliably, the framework community is rediscovering from first principles.
The Loop as Event Consumer
The minimal agent loop in Python:
while True:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
break
tool_results = execute_tool_calls(response.content)
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
Compare this to a standard event consumer loop, whether for a Kafka consumer or a Discord bot gateway handler: receive message, dispatch to handler, handle side effects, acknowledge receipt, wait for next message. The mechanics are the same. An input arrives, processing occurs, the consumer advances to the next input. The handler here is a language model rather than a function, but the loop structure is identical.
This similarity is not just a useful analogy. It means that years of accumulated engineering practice around event-driven reliability applies directly, and frameworks that ignore that history are going to relearn its lessons through production failures.
The Context Window Is an Append-Only Event Log
In event-driven architecture, event sourcing is the pattern of storing system state as an ordered log of events rather than as mutable current state. To find current state, you replay the log. The pattern carries specific engineering implications: the log grows indefinitely, it needs periodic compaction, replaying it is more expensive than reading current state, but it gives you a complete audit trail and the ability to reconstruct state at any point.
The context window in an agent run is an event-sourced log. Every tool call, every tool result, every model response is appended to the conversation history. The model’s current understanding is derived by processing the full log at each turn. This is expensive: at 200k tokens, the model processes roughly 150,000 words of history on every forward pass. But it gives you exactly what event sourcing gives you, full auditability and the ability to understand precisely what the model had available when it made each decision.
Context compaction is log compaction. When Claude Code’s context approaches roughly 85% capacity, a secondary model call summarizes the accumulated history into a dense representation, and the middle portion of the log is replaced with that summary. Kafka’s log compaction keeps the earliest and most recent entries verbatim, compacting the middle by retaining only the last known value per key. The agent version is lossier because there are no discrete keys; the model has to summarize by judgment rather than deduplication. But the trade-offs are the same: compaction reduces storage and processing cost at the expense of precision.
Liu et al.’s Lost in the Middle paper demonstrated that LLM recall degrades significantly for content in the center of long contexts. This is the event-sourcing equivalent of log bloat: the information is technically present, but the system’s effective ability to use it degrades with log length. Compaction is the engineering response to both problems.
Error Handling: Dead Letters and Infinite Retries
In event-driven systems, the unhandled error pattern has a name: a dead letter queue. When a consumer fails to process a message after N retries, the message routes to a dead letter queue for inspection. Without this pattern, two failure modes emerge: the consumer retries indefinitely, blocking the queue, or the error is silently dropped.
Agents face the same two failure modes without an equivalent mechanism. The infinite retry mode: the model encounters a FileNotFoundError, tries to find the file using a different path, hits the same error, tries yet another approach. Without a retry bound at the orchestration layer, this loop consumes the entire context budget on a single failed operation. The silent drop mode: error messages accumulate in the context, and the model, deep into a long run, stops weighting them appropriately due to the lost-in-the-middle effect.
The ReAct framework from Yao et al. (2022) helped because it makes reasoning explicit before each tool call. When an error occurs, the model writes out its interpretation before deciding on a corrective action, making the error visible in the trace even when recovery fails. But explicit reasoning does not eliminate the need for retry bounds and escalation policies. Those need to exist at the orchestration layer, not inside the model’s context. A tool description that ends with “if this fails, do not retry more than twice and report the error to the user” is encoding dead-letter behavior into the handler specification.
Idempotency and Partial Completion
Event-driven systems built on at-least-once delivery semantics require idempotent handlers. If a message can be delivered twice, processing it twice must produce the same result as processing it once.
Agentic systems face the same requirement when subagents are involved. When an orchestrator spawns a subagent that writes files and opens a pull request, then fails before returning a clean response, the orchestrator faces the standard distributed systems dilemma: retry risks duplicate actions; no retry risks treating partial execution as complete failure. Without idempotency keys or a completion manifest, there is no safe choice.
The structured completion manifest pattern addresses this:
{
"files_modified": ["src/auth.py", "tests/test_auth.py"],
"external_actions": [{"type": "pull_request", "url": "https://...", "number": 42}],
"tests_passed": true,
"verification_output": "All 23 tests passed"
}
Before retrying, the orchestrator reads the manifest to determine what has already been done. This is the same pattern Stripe uses for idempotency keys: record intent before action, check the record before retrying. The agent version writes state to a file rather than to a database, but the semantics are identical.
The Probabilistic Handler
What distinguishes an agent loop from a standard event loop is the nature of the handler. In a normal event consumer, the handler is deterministic: given the same input, it produces the same output and side effects. In an agent loop, the handler is a language model: given the same input, it will probably produce similar outputs across runs, but probabilistic sampling means identical inputs do not guarantee identical outputs.
This changes the engineering discipline in one specific way: how you validate correctness. Deterministic systems can be exhaustively tested on specific inputs. Probabilistic systems need evaluation across distributions. The emerging practice treats agent evaluation like the LLM-as-judge pattern or golden trace comparison: run the agent on representative tasks, measure the fraction of runs producing acceptable outcomes, calibrate measurement against human-labeled ground truth. This is borrowed from ML evaluation practices, not software testing, and it is a genuine discontinuity from what engineers building event-driven systems have learned.
The rest of the discipline transfers directly. Context management, error handling, idempotency patterns, observability requirements: these are event-driven systems problems, and the event-driven systems community has already worked out the solutions. Agentic frameworks that treat all of this as new territory will arrive at the same answers, just more slowly and after more production incidents.
Willison’s framing of agentic engineering as a distinct discipline is correct. The distinctiveness lies in what is new: the probabilistic handler, the evaluation methodology, the context-as-working-memory constraint. Most everything else, the backpressure patterns, the idempotency requirements, the observability infrastructure, is applied event-driven engineering. The sooner the agentic tooling ecosystem acknowledges that lineage, the less of it needs to be reinvented.