The Scaffolding Is the Software: Engineering for LLM Agents

The dominant conversation about LLM agents focuses on which model to use. That is the wrong frame. Model capability matters at the margins, but the system surrounding the model, how context is constructed, how tools are defined, how failures are handled, how long-horizon state is preserved, determines whether an agent is reliable. This is the core claim in Simon Willison’s agentic engineering guide, and it is worth unpacking in full.

What the Agent Loop Actually Is

The foundational pattern is the ReAct loop, described by Yao et al. in 2022. The model receives a task and a set of available tools. It reasons about what to do, emits a tool call, receives the result, updates its reasoning, and decides whether to call another tool or return a final answer. The loop continues until termination.

Written out as code, the structure is straightforward:

def agent_loop(task: str, tools: list[Tool], model) -> str:
    messages = [{"role": "user", "content": task}]

    while True:
        response = model.complete(messages=messages, tools=tools)

        if response.stop_reason == "end_turn":
            return response.content

        # Model requested a tool call
        tool_call = response.tool_call
        result = dispatch(tool_call, tools)

        messages.append({"role": "assistant", "content": response.content, "tool_calls": [tool_call]})
        messages.append({"role": "tool", "tool_call_id": tool_call.id, "content": result})

This is the entire structure. The loop runs until the model emits a stop signal. Every agent framework from LangGraph to AutoGen to Anthropic’s tool use API is a variation on this pattern, with added orchestration, parallelism, multi-agent routing, or memory layers on top. The loop itself is not the interesting engineering problem. What surrounds it is.

Why the Scaffolding Matters More Than the Model

A capable model in a poorly designed scaffold will fail in ways that a less capable model in a well-designed scaffold will not. The reasons are structural. The model operates on whatever appears in its context window; it cannot reason about what is absent. Its tool calls are only as good as the tool definitions it receives. Its error recovery is limited by whether the scaffolding surfaces actionable failure information or swallows exceptions silently.

Consider tool design. A model given a filesystem tool with a mode parameter covering read, write, delete, and list will use it inconsistently. The same model given separate read_file, write_file, and list_directory tools with accurate descriptions will use each correctly. The model did not change; the scaffold did. The precision of a tool’s description is one of the highest-leverage variables in agent reliability, and it is entirely under the developer’s control.

The same logic applies to error handling. If a tool call fails and the scaffold returns a generic exception string, the model has little basis for recovery. If the scaffold returns a structured error with context, the model can adapt. The intelligence of the loop is a product of both the model’s reasoning and the quality of information the scaffold feeds back into it.

Context as a Managed Resource

The context window is a bounded resource, and managing it well is the central operational challenge of agentic engineering. An agent working on a long-horizon task accumulates tool results, intermediate reasoning, and prior model outputs in its context on every turn. Without intervention, context grows monotonically until it hits the window limit or costs become prohibitive.

This maps cleanly onto OS memory management. The context window is physical memory. Long-term storage, vector databases, external files, summarized history, is swap space. The scaffolding is the memory manager. An agent with no context management strategy is a process that never frees memory; it runs fine until it does not, and the failure is abrupt.

The practical patterns look like their OS equivalents. Summarization compresses old context the way an OS compresses pages under memory pressure. Retrieval augmentation fetches relevant fragments on demand, analogous to demand paging. Conversation windowing discards history beyond a fixed length, matching a fixed-size ring buffer. None of these are free: summarization loses detail, retrieval introduces latency and relevance errors, windowing breaks tasks that require reference to earlier steps. Every context management strategy is a tradeoff between token cost and information fidelity, and the scaffolding must make that tradeoff explicit rather than letting the context window fill until the API returns a context length error.

Distributed Systems Parallels

Agentic systems exhibit failure modes that practitioners of distributed systems will recognize immediately. A tool call is a remote procedure call over an unreliable channel. The model is a probabilistic executor, not a deterministic one. Multi-agent pipelines are distributed systems with all the attendant problems of partial failure, inconsistent state, and out-of-order delivery.

Idempotency is as important here as it is in service-to-service RPC. A tool that produces side effects on every call, sending an email, posting a message, writing a record, must be treated as a non-idempotent operation. If the scaffolding retries on failure without checking whether the prior attempt succeeded, it will produce duplicate effects. The scaffolding needs to distinguish between safe retries and operations that require human confirmation or deduplication logic before retrying.

Retry with backoff is necessary but not sufficient. A model that receives a rate limit error will, if given no guidance, either halt or retry immediately. The scaffolding must enforce backoff, and more importantly, it must track retry budgets so that a single flaky tool call does not cause the agent to spin indefinitely.

For multi-step workflows that involve state mutations across multiple systems, the Saga pattern applies directly. If an agent writes a database record, sends a notification, and updates a downstream service, and the third step fails, the scaffolding needs a compensation strategy for the first two. This is not a problem that model capability solves; it requires explicit design of compensating actions and a rollback policy in the scaffolding layer.

These are not exotic concerns. They are the standard discipline of distributed systems engineering, applied to a new kind of executor.

What Willison’s Guide Gets Right

Willison’s guide is useful because it approaches agentic engineering from the engineering side rather than the capabilities side. The emphasis on minimal footprint, granting agents only the permissions they need, is directly traceable to the principle of least privilege in security engineering. The emphasis on preferring reversible actions over irreversible ones is a variant of the same principle applied to operational risk.

The guide’s treatment of prompt injection is also grounded. Prompt injection is not primarily a model problem; it is an architectural one. When an agent reads a document or fetches a webpage, that content is untrusted input arriving through a channel the scaffolding treats as a tool result. There is no cryptographic boundary separating instruction from data, because the model processes both as tokens. The only practical mitigations are architectural: minimize the agent’s permissions, require human confirmation before high-consequence actions, and log tool results alongside subsequent model decisions so anomalous behavior is traceable after the fact.

The guide is also correct that observability is a first-class engineering requirement, not an afterthought. An agent that produces wrong results and produces no structured trace of its reasoning and tool calls is effectively undebuggable. You need every tool call, its inputs, its outputs, and the model’s next decision, captured in structured form, to have any basis for diagnosis.

Where Hard Problems Remain

Long-horizon reliability is the open problem that current scaffolding patterns do not fully address. An agent running a task that spans dozens of tool calls across an extended session accumulates context, makes errors, and has no intrinsic mechanism for detecting when its reasoning has drifted from the original goal. The scaffolding can impose checkpoints and summaries, but there is no principled solution to the problem of goal drift over long execution horizons. This is an area where the field is still working from empirical observation rather than established patterns.

Prompt injection at scale remains unsolved. The architectural mitigations reduce risk; they do not eliminate it. An agent with broad tool access and exposure to untrusted content is a target, and the attack surface grows with the agent’s capabilities. Efforts to train models to resist injection are ongoing but have not produced robust defenses against adaptive adversaries.

Cost is a practical constraint that engineering discussions underweight. Context accumulation across a long-running agent task can produce per-run costs that are difficult to predict at design time, because the number of tool calls is non-deterministic. A scaffolding layer with no cost controls will produce surprises in production. Token budgets, maximum loop iteration limits, and cost estimation based on observed task distributions all need to be part of the scaffolding design, not bolted on after the first unexpected invoice.

The field is moving quickly, and the patterns are solidifying. But the core insight is already clear: the scaffolding is the software. The model is one component of a larger system, and the reliability of the whole is determined by the quality of every layer surrounding it.