· 6 min read ·

Agentic Engineering Is Distributed Systems With One New Problem

Source: simonwillison

Most of the engineering challenges in agentic systems already have names. Simon Willison’s guide to agentic engineering defines the discipline cleanly: an agent is a system where an LLM takes actions, observes results, and decides what to do next. That feedback loop introduces a cluster of problems that feel unfamiliar but are largely recombinations of problems distributed systems engineers have been solving for decades. Seeing where the patterns come from makes the one genuinely new challenge clearer.

The Loop Is a Saga

The canonical agentic loop, tool call, observe result, reason about next action, repeat, is structurally a saga. The saga pattern was formalized in the late 1980s as a way to handle long-running distributed transactions without holding locks. A saga is a sequence of local transactions with compensating actions for rollback. A banking transfer breaks into “debit account A, credit account B” rather than one distributed atomic operation, with “re-credit account A” as the compensation if step two fails.

An agentic loop has the same shape. Each tool call is a local operation. The sequence is coordinated not by a transaction manager but by the model’s judgment. Partial failure, the agent completing steps one through five then failing at step six with the world in an intermediate state, is exactly the problem saga coordinators exist to handle.

The engineering implications follow from this parallel. Tools should be idempotent where possible. Destructive actions need compensation mechanisms, or human confirmation before they execute. Checkpointing intermediate state means a failed agent can resume without replaying from scratch:

def run_agent_with_checkpoint(task, checkpoint_path):
    state = load_checkpoint(checkpoint_path) or {
        "messages": [{"role": "user", "content": task}],
        "completed_steps": []
    }

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            tools=tools,
            messages=state["messages"]
        )

        if response.stop_reason == "end_turn":
            clear_checkpoint(checkpoint_path)
            return response.content[-1].text

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = dispatch_tool(block.name, block.input)
                state["completed_steps"].append({
                    "tool": block.name,
                    "input": block.input,
                    "result": result
                })
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })

        state["messages"].append({"role": "assistant", "content": response.content})
        state["messages"].append({"role": "user", "content": tool_results})
        save_checkpoint(checkpoint_path, state)

This checkpoint/resume pattern appears in Anthropic’s guidance on building effective agents and mirrors what Temporal, AWS Step Functions, and similar workflow engines do for distributed sagas. The vocabulary transfers directly.

Context Management Is Log Compaction

Every turn through the agentic loop appends data to the model’s context window: the original task, each tool call, each result, each reasoning trace. The context window is an append-only event log, and like any append-only log, it has a growth problem.

The strategies for managing it come from the same place. Summarization of completed steps is log compaction: you compress older entries into a summary record and discard the originals. Storing retrieved documents in an external vector store rather than inline is externalizing state, the same pattern Kafka Connect uses when it offloads large payloads to object storage rather than encoding them in the log. A sliding window that discards old tool results is a ring buffer with a configurable depth.

Claude’s context window sits at 200,000 tokens, which delays rather than eliminates the problem. A complex agent retrieving multi-page documents across a dozen steps will still exhaust it, and even before hitting the hard limit, model performance tends to degrade as context grows. Models attending over increasingly large windows of prior results do not always make better decisions. The management strategies are real requirements, not edge cases you address later.

Prompt Injection Is SSRF

Server-Side Request Forgery works by getting a server to make an outbound request to an attacker-controlled URL, which redirects to an internal service the attacker otherwise cannot reach. The server acts on external input without adequately constraining what that input can cause it to do.

Prompt injection has the same structure. When an agent reads a webpage, processes a document, or retrieves database content, that content is external input. If it contains text formatted to look like instructions, a naive agent may follow those instructions, because the model treats them as part of its operational context rather than as data to be processed.

Willison has documented this vulnerability extensively, and the mitigations follow the same logic as SSRF defenses: treat external content as untrusted regardless of the channel it arrives through, restrict what the agent is permitted to do (particularly around network access and write operations), and log all tool calls for audit. The Model Context Protocol, released by Anthropic in late 2024, attempts to standardize tool interfaces partly as a way to formalize permission boundaries. These are standard practices in web security applied to a new surface.

Minimal Footprint Is Least Privilege

The principle Willison calls minimal footprint, agents should request only the permissions they need, prefer reversible actions, and avoid accumulating resources, is the principle of least privilege, which Saltzer and Schroeder articulated in 1975. Every process should have access to only what it needs to complete its task.

Applying it to agents looks like: a code review agent has read access to the repository, not write access. A customer support agent can read tickets and post replies, not delete accounts or access billing. A file analysis agent operates on a sandboxed directory, not the full filesystem. The argument for these constraints is stronger for agents than for traditional software precisely because the agent’s behavior is not fully enumerable by the programmer. An agent that can only make reversible, low-footprint actions is bounded in the damage it causes when it behaves unexpectedly.

The One Genuinely New Problem

The patterns above are borrowed. The genuinely new engineering challenge in agentic systems is that the decision function is probabilistic and not fully inspectable.

A traditional distributed system has a deterministic orchestrator. You can enumerate the state transitions. You can write tests that cover every branch. You can reason exhaustively about what the system will do. When it fails, the failure has a traceable cause in the code.

An LLM orchestrator makes decisions by predicting token sequences. Given the same input twice, it may produce different tool calls. Given a novel input it was never tested against, it may produce behavior you did not anticipate. The space of inputs it might encounter in production is far larger than any test suite can cover.

This breaks the standard testing methodology. Unit tests that assert specific tool sequences are brittle against non-deterministic decision-making. What replaces them is evaluation against outcomes: you measure whether the agent produced a correct result, not whether it followed a specific path. This requires building evaluation infrastructure, sometimes using another LLM as a judge, and statistical monitoring in production rather than binary error rates. A 2% degradation in task completion quality looks nothing like a null pointer exception.

Observability infrastructure for agents is still catching up. Platforms like LangSmith, Arize, and Weights and Biases have added agent tracing, but the tooling is not as mature as what is available for traditional distributed services. Structured traces that capture every tool call, its inputs and outputs, the model’s reasoning trace, and cost attribution per run are necessary for debugging unexpected agent behavior. Without them, investigating a production failure means reconstructing events from incomplete logs, which is the situation distributed systems engineers spent years trying to escape.

The Shape of the Discipline

Agentic engineering is not a departure from existing software engineering practice. It is an application of distributed systems design, API design, and security engineering to systems where the orchestrator happens to be a language model. Most of the patterns have prior art. Most of the failure modes have names.

The discipline is specific because the combination is specific. Systems with non-deterministic, probabilistic decision functions at their center require different verification strategies, different observability approaches, and different assumptions about what testing can guarantee. Knowing the borrowed patterns lets you move faster on the parts that are already solved. Understanding the genuinely new challenge prevents mistaking what is solved for what is not.

Was this interesting?