The Loop Is the Boundary: What Makes Agentic Engineering Its Own Discipline
Source: simonwillison
Most writing about AI agents focuses on capability: what they can do, which tasks they can automate, how far they can run without human intervention. Simon Willison’s guide on agentic engineering takes a different approach. It defines the boundary where “using an LLM” becomes “engineering with LLMs,” and that boundary is simpler than most descriptions suggest. It is a loop.
The minimal agentic system has three parts: a model, a set of tools that return structured output, and a loop that feeds tool results back into the model’s context. The code for that loop fits in a few lines:
while not done:
response = model(context)
if response.has_tool_call:
result = execute_tool(response.tool_call)
context.append(tool_result(result))
else:
done = True
return response.content
That is the pattern. What makes this a discipline rather than a technique is everything the code omits: how tools fail, how context fills up, what “done” actually means, how results from untrusted external systems flow back through the model’s decision-making. The loop itself is trivial. The engineering scaffolding around it is the discipline.
Why the Loop Is a Qualitative Change
Prompt engineering and agentic engineering differ in kind, not degree. A single prompt exchange has contained failure modes: the response is wrong, hallucinated, or poorly formatted, but the failure is local. Once you add a loop, errors compound. If each step in a five-step pipeline succeeds 90% of the time independently, the pipeline’s end-to-end reliability is roughly 59%. Real-world agent evaluations like SWE-bench confirm this arithmetic. The agents with the highest headline scores on individual tasks still fail frequently on multi-step tasks.
More importantly, the loop hands control flow to a probabilistic system. In traditional software, you write the control flow and the model produces text. In an agentic system, the model decides which tool to call, when to stop, how to interpret ambiguous results. The developer’s job shifts from writing logic to engineering scaffolding that makes probabilistic control flow predictable enough to trust.
This is not a new problem. It is the same problem behavior trees solved in game AI: structured fallback logic that handles tool failure without relying on ad-hoc model heuristics. The ReAct pattern (Yao et al., 2022), which interleaves reasoning and acting and now underlies most agent frameworks, recapitulates decades-old ideas from STRIPS planning and expert systems, with LLMs replacing hand-coded operator preconditions. What LLMs changed was the knowledge-acquisition bottleneck: instead of encoding domain knowledge manually as rules, you can describe a tool’s behavior in natural language and a well-trained model will generalize correctly most of the time. “Most of the time” is where the engineering begins.
The Context Window Is State
The agent loop accumulates state in the context window. That is the entire state management model by default. There is no database, no external memory, no transaction log unless you build one; just an append-only list of messages, tool calls, and results that grows until it hits the model’s context limit.
This creates real architectural pressure. A modest session, reading three files, running a grep, editing a function, running tests, consumes several thousand tokens before any substantive reasoning. For a coding agent working on a non-trivial feature, the context fills faster than most implementations account for. Once filled, you have three options: summarize older content and lose fidelity, externalize content to retrieval and rely on the model knowing what to fetch, or use explicit structured state objects that the model reads and writes, which is the approach Letta/MemGPT takes.
The choice between these is not just about token efficiency. The “lost in the middle” effect, documented by Liu et al. at Stanford and UC Berkeley (2023), shows that LLM recall degrades for content positioned in the middle of long contexts. Long-running agents can exhibit attention drift from their own earlier decisions even within the nominal context limit. Context anchoring, the pattern of externalizing key decisions and constraints into a document that gets re-injected at position zero at session boundaries, is a direct engineering response to this. It maps precisely to Architecture Decision Records (ADRs): the same problem of preserving the reasoning behind decisions across time, solved at the session timescale instead of the codebase timescale. Rahul Garg’s writeup on context anchoring, published on Martin Fowler’s site, covers this pattern in detail. Claude Code’s CLAUDE.md and Cursor’s .cursorrules are implementations of the same idea in different packaging.
Tool Design Is API Design
Tool descriptions are the only interface through which an agent model understands its capabilities. A weak description is structurally equivalent to an underspecified API: the model fills the gaps with training-time priors that may not match your intent. This means tool design deserves the same care as public API design.
Concrete principles follow from this: use verb-noun tool names (search_codebase, write_file). Specify what a tool does not do, not just what it does. Use enum types to constrain parameter space rather than accepting free-form strings where structured input is expected. Make return values self-describing. The naming of parameters carries semantic weight; record_id_to_permanently_delete changes downstream model behavior compared to id.
The Anthropic tool use documentation covers schema design, but the deeper principle is that tool descriptions are the API contract between developer intent and model execution. Underspecified contracts produce inconsistent behavior at runtime, exactly as underspecified interfaces do in conventional software.
The Security Problem Is Structural
When an agent reads files, calls APIs, or browses external content, that content can contain instructions the model treats as directives. This is prompt injection, and the Willison guide calls it “the SQL injection of agent security,” which is apt because both are injection attacks that exploit the same ambiguity: the system cannot distinguish data from instructions without additional mechanisms.
What makes this particularly sharp in agentic systems is that the attack surface scales with the agent’s permissions and with the depth of the agent tree. In multi-agent systems, a successful injection at any node can propagate downstream through orchestration calls. The InjecAgent benchmark found GPT-4-turbo succeeded on prompt injection attacks roughly 24% of the time in single-agent settings. In a pipeline of three agents, the probability that at least one injection attempt lands is substantially higher.
The engineering response is the minimal footprint principle: agents should request only the permissions required for the current task, prefer reversible operations over irreversible ones, and avoid storing sensitive data beyond immediate need. This is the principle of least privilege applied to agentic systems, and it doubles as an architectural heuristic. If you can enumerate a subagent’s required tool access precisely, the task is well-scoped. If you cannot enumerate it, the decomposition probably has coherence problems that will surface at runtime.
Evaluation Is Not Testing
Unit tests verify that deterministic code produces exact outputs from exact inputs. Agentic systems are non-deterministic, so the same framework does not apply. Running a task once and checking the result produces a data point, not a test. Running it ten to twenty times and measuring the distribution starts to approximate an evaluation.
The practical approach involves golden traces: representative scenarios with expected action sequences, checking that the right tool classes were used and the forbidden ones were not, with soft matching on ordering. LLM-as-judge, a second model assessing the first’s decisions, can scale this, provided you calibrate the judge against human-labeled examples first; uncalibrated judges have their own biases. Observability tooling like LangSmith and Weights and Biases Weave treats agent runs as annotatable trace trees rather than flat logs, which matters because agentic failures often only become visible when you inspect the full sequence of decisions, not just the final output.
This is the production boundary. Building an agent that demos well is not difficult. Building one that handles real workloads with bounded costs, understood failure modes, and defensible security properties requires treating evaluation as a first-class engineering concern from the start.
The core contribution of Willison’s guide is naming this collection of concerns, context management, tool design, prompt injection, evaluation methodology, error compounding, as a coherent discipline rather than a bag of tips. The loop is the boundary condition. Cross it, and you are no longer doing prompt engineering. You are building a probabilistic control system, and the tools for reasoning about it come from distributed systems, security engineering, and decades of prior work on planning and agent architectures that predate LLMs by fifty years. The vocabulary is new. The problems are not.