What the Agent Loop Obliges You to Build

When you write an LLM call, you have one decision to make: what to put in the prompt. When you wrap that call in a loop that feeds results back in, you have dozens. The loop is four lines of code. The engineering obligations it creates are the whole discipline.

Simon Willison’s guide on agentic engineering patterns defines the boundary precisely: agentic engineering begins when a language model drives a loop, calling tools and observing results until a goal is met. The code for that loop is short enough to fit in a comment block. What makes it a discipline is what follows from it.

Those engineering obligations are not independent concerns. They share a root: you have delegated control flow to a probabilistic system that operates through side effects on shared state. Every obligation traces back to that single decision.

The State Obligation

In a traditional program, state lives in data structures you define, update deliberately, and can inspect at any point. In an agent loop, state lives in the context window. It is append-only, finite, and managed implicitly. Every tool call appends inputs and outputs to the message history. The model’s next decision depends on everything it has seen since the start of the session.

A modest coding task reading three files, running tests, and applying a fix consumes tens of thousands of tokens before any substantive reasoning. For a non-trivial session, the window fills faster than most implementations account for.

Filling it creates a qualitative problem, not just a quantitative one. Research from Stanford and UC Berkeley documented the “Lost in the Middle” effect: LLM recall degrades for content positioned in the middle of long contexts, not just at the end. A model working through a long session can show attention drift from its own earlier decisions well before hitting the nominal context limit. The useful working memory of an agent is smaller than the advertised context window for anything but short sessions.

The engineering responses are well-understood: summarize older content and accept fidelity loss, externalize content to retrieval and design selection logic, or maintain explicit structured state the model reads and writes directly. The Letta/MemGPT project explored the third approach, treating the context window as virtual memory with explicit paging. Claude Code’s CLAUDE.md and Cursor’s .cursorrules are simpler versions of the same idea: key constraints re-injected at position zero to fight drift. None of these eliminates the state problem; they manage it at different points on the fidelity-complexity curve.

The Contract Obligation

Every tool call the model makes is driven by a tool description. The description tells the model what the tool does, when to use it, what parameters it expects, and what it returns. It is the only interface between developer intent and model execution.

This is an API design problem, and it deserves API design rigor. Anthropic’s tool use documentation covers the mechanics. The principle behind them: tool descriptions are the precondition specifications for your agent’s operators. The vocabulary maps directly to STRIPS, the 1971 planning formalism from Fikes and Nilsson, where each operator had explicit preconditions the planner checked before application. LLMs do not verify preconditions formally; they reason about them probabilistically from the description. Vague descriptions produce vague precondition reasoning, which produces unpredictable tool application.

Concrete consequences follow. Use verb-noun names: write_file, search_codebase. Specify what a tool does not do, not just what it does. Use enum types to constrain parameter space rather than accepting free-form strings where structure is expected. Make return values self-describing. Parameter naming carries semantic weight at decision time; record_id_to_permanently_delete changes model behavior compared to id. The description is the contract; underspecify it and you get underspecified behavior at runtime.

The Reliability Obligation

A single LLM call with a known failure rate is a single risk. A pipeline of such calls compounds risk multiplicatively. If each step in a five-step sequence succeeds 90% of the time, the end-to-end reliability is roughly 59%. Real-world agent evaluations like SWE-bench confirm this arithmetic; headline per-task success rates look much better than pipeline reliability on multi-step problems.

Error propagation in agent loops has properties worth engineering against. A bad tool call result early in a session corrupts downstream reasoning. If the context window is the state, bad state persists until it is explicitly corrected or the session resets. Traditional debugging assumes you can reproduce a failure and inspect the state at the point of failure. Agent debugging requires reconstructing which step produced bad output and how it propagated through subsequent reasoning.

Circuit breakers, borrowed from distributed systems, address one failure mode: stopping the agent loop after N consecutive tool failures rather than letting it spiral into accumulating corrupted context. The underlying obligation is harder, though. You need to design for graceful degradation at each step, which requires knowing what the failure modes are before you encounter them in production. That knowledge comes from evaluation infrastructure, not from the code itself.

The Security Obligation

Every tool that reads external content, whether files, web pages, API responses, or database results, introduces a surface for prompt injection. The agent’s job is to follow instructions. If instructions appear to come from a tool result rather than a system prompt, the model often follows them anyway. The InjecAgent benchmark found GPT-4-turbo succeeded on injection attacks roughly 24% of the time in single-agent settings. Multi-agent pipelines compound this: a successful injection at any node propagates downstream through whatever resources the compromised agent can reach.

The minimal footprint principle is both a security control and an architectural heuristic. If you can enumerate a subagent’s required permissions precisely, the task is well-scoped and the blast radius of a compromise is bounded. If you cannot enumerate them, the decomposition has coherence problems that will surface at runtime as well as in security analysis. Least privilege, reversible operations over irreversible ones, explicit scoping of what each agent can write and read from external sources: these are not agent-specific controls. They are the same controls that apply in any multi-component system with untrusted inputs. They are just more critical when the component making authorization decisions is probabilistic rather than deterministic.

The Evaluation Obligation

Unit tests assert exact outputs from exact inputs. Agentic systems are non-deterministic, so that framework does not apply directly. Running a task once and checking the result produces a data point, not a test. A single failure is not informative about the distribution of failures. A single success is not a guarantee of anything.

The practical approach involves golden traces: representative tasks with expected action sequences, checking that the right tool classes were called and the forbidden ones were not, with soft matching on ordering. Running each scenario ten to twenty times and measuring pass rates across the distribution. LLM-as-judge, using a second model to assess the first’s reasoning, scales this evaluation when calibrated against human-labeled examples first; uncalibrated judges inherit the same biases as the system under evaluation. Anthropic’s guidance on evals suggests calibrating against at least fifty human-graded examples before trusting automated scoring.

Observability infrastructure matters because agentic failures often only become visible when inspecting the full decision sequence, not the final output. A response that looks correct may have been produced through a flawed sequence that generalizes incorrectly to different inputs. Tools like LangSmith and Weights and Biases Weave model agent runs as trees of annotatable spans rather than flat logs, which matches the structure of the problem. The minimum viable version is a trace decorator that propagates a trace_id through every model call and tool invocation, so you can reconstruct the causal chain when something goes wrong.

The Unifying Cause

What makes these obligations a discipline rather than an unrelated checklist is that they share a cause. You delegated control flow to a probabilistic system. Every obligation that follows is a different facet of managing that choice.

The state obligation exists because the probabilistic system needs context, and context has real constraints. The contract obligation exists because the system’s decisions are only as good as its interface specifications. The reliability obligation exists because probabilistic steps compose pessimistically. The security obligation exists because the system cannot distinguish data from instructions without engineering support. The evaluation obligation exists because the system’s behavior must be characterized empirically, not proven analytically.

Willison’s guide names this collection of concerns as a discipline. The naming matters because it positions agentic engineering correctly: not as a variant of prompt engineering and not as traditional software engineering with an LLM call added. It is a practice for building systems where a probabilistic reasoning process drives real side effects, and the engineering is the scaffolding that keeps those side effects within acceptable bounds. The loop commits you to that scaffolding. What you build with it determines what the system can reliably do.