Simon Willison’s guide to agentic engineering puts a name and boundary around something that has been building for a couple of years: there is now a recognizable engineering discipline around building LLM-powered systems that act, not just generate. The question worth exploring is what, specifically, makes it a discipline rather than a craft skill or a prompting trick.
The answer is the loop.
The Loop as the Line
A single-turn LLM call is, in engineering terms, a function call with a wide output type. You send tokens in, you get tokens out. It can be unreliable, it can hallucinate, but the failure modes are contained to one exchange. Testing it is straightforward, even if evaluating quality is not.
The moment you connect that function call to a tool execution layer and feed the result back into the next call, everything changes. You have introduced state, side effects, and branching across time. The model is no longer doing autocomplete; it is making control flow decisions. It decides which tool to call, what arguments to pass, whether to retry a failed result or escalate to a different strategy. The software engineering problem is no longer “write a good prompt” but “build a reliable system whose control flow is implemented by a stochastic model.”
The minimal viable agentic system has three parts: a model, a set of tools that return structured output, and a loop that feeds tool results back into the model’s context. The ReAct pattern formalized this structure in 2022, interleaving model reasoning with action execution. Every production framework, from LangGraph to the OpenAI Agents SDK, is a variant on that structure. The novelty is in the engineering around it, not the loop itself.
Tool Design Is API Design
One of the clearest signs that agentic engineering is a real discipline is that it has developed opinions about tool design, and those opinions are specific.
Every tool description the model reads at inference time is an API contract. The model decides whether to call a tool, what arguments to pass, and how to interpret the output based entirely on the name, description, and schema you provide. A poorly described tool gets called at the wrong time, with the wrong arguments, or not at all.
The Anthropic tool use documentation notes that descriptions should cover intent, input structure, output contract, and triggering conditions. In practice, a description like “Review code for security issues” underspecifies all four. A description that says what it looks for, what format it returns findings in, what severity levels it distinguishes, and when it should be called produces measurably different model behavior. You can write unit tests against individual tools, but the harder problem is testing whether the model makes good decisions about when to use them.
Granularity is a design decision with real tradeoffs. Fine-grained tools give the model precise control at the cost of more decisions per task. Coarse-grained tools reduce the decision surface but limit flexibility. The practical recommendation from Claude Code’s design is to start coarser and decompose only where you observe the model getting stuck.
Tool naming carries semantics too. record_id_to_permanently_delete makes the model reach for confirmation differently than id would. Naming communicates intent and risk in a way that the model internalizes.
Context Window as Process State
Every tool result appended during an agent run occupies tokens. A moderately complex coding task reading twenty files at 300 lines each consumes roughly 120,000 tokens in file content alone, before system prompt, user messages, and model reasoning. Claude’s 200k-token context sounds large until you account for a multi-step task with substantial tool output at every step.
The engineering problem is that context pressure affects behavior well before the hard limit. The “lost in the middle” paper from Stanford and UC Berkeley documented that LLM recall degrades significantly for content positioned in the middle of long contexts. This means a file read in turn five of a 30-turn agent session may receive less effective attention in turn 25 than a fresh re-read would. Long sessions effectively produce recall degradation that is hard to observe without structured evaluation.
Three strategies exist. Summarization compresses prior context periodically, trading fidelity for token economy. Retrieval-augmented generation externalizes context and fetches relevant chunks on demand, which requires deciding what to retrieve and when. Structured state, as implemented by Letta (formerly MemGPT), treats the context window as virtual memory with explicit paging. Each approach has different complexity costs and failure modes.
Context anchoring, described by Rahul Garg on Martin Fowler’s site, takes a simpler approach: maintain a living markdown document with decisions, constraints, and current scope, and re-inject it at transition points. Claude Code’s CLAUDE.md and Cursor’s .cursorrules are both implementations of this idea, taking advantage of high attention weight at position zero in the context.
Multi-Agent Systems Are Distributed Systems
Spawning a subagent is structurally identical to an RPC call with non-idempotent side effects. The orchestrator blocks waiting on a remote call that may fail mid-execution after side effects have already occurred. File writes may have happened. A GitHub PR may have been opened. A Slack message may have been sent. Most agent frameworks have not built the failure semantics to match the distribution they implement.
Error compounding is the concrete consequence. If each agent step in a five-step pipeline succeeds 90% of the time, end-to-end success probability is roughly 59%. This is not a hypothetical: the SWE-bench benchmark, which measures agents against real GitHub issues, has been a useful reminder that error rates compound across real tasks in ways that single-step evaluations hide.
Distributed systems patterns apply directly. Idempotency keys, Stripe’s approach for payment operations, address the retry-after-failure problem for external write operations. The saga pattern with compensating actions, which LangGraph has partial checkpointing support for, handles multi-step workflows where intermediate state must be rolled back on failure. Write-ahead logging provides durability guarantees that most frameworks do not currently implement.
Prompt injection gets worse in this context. In a single-agent system, a successful injection in a retrieved document can cause the model to execute attacker-controlled instructions. In a multi-agent system, that injected instruction can propagate downstream. The InjecAgent benchmark found GPT-4-turbo succeeded in attacks roughly 24% of the time under single-agent conditions; in chained pipelines, the rate compounds at each hop. Tool restriction is the most robust defense: a subagent with only read access cannot act on a shell injection regardless of what the model decides. Schema-validating subagent output before using it to drive further orchestration breaks the data-to-control-path that injections exploit.
Evaluation Is Not Testing
Unit tests for individual tools are necessary but they test the wrong thing. What you need to test is whether the model makes good decisions about when and how to use tools, across realistic multi-step scenarios. That is not a property you can assert with a deterministic test.
The dominant approach in the field is golden traces: representative scenarios with known-correct action sequences that you compare against. LLM-as-judge evaluation, where a second model assesses the first’s decisions, has become common and introduces its own reliability questions, particularly when evaluator and evaluated model share similar failure modes. Observability tooling like LangSmith and Weights & Biases Weave treat agent runs as annotatable traces rather than flat log streams, which at least makes the problem visible.
What the Discipline Demands
Agentic engineering is following the arc of distributed systems engineering. It started as something people bolted together without a shared vocabulary, accumulated failure modes that were surprising the first time and predictable the second, and is now developing patterns, tooling, and evaluation methodology. Willison’s framing of it as a distinct discipline is accurate and useful, because it sets expectations for what the work involves.
Building an agent that demos well is not difficult. Building one that handles real workloads reliably, with bounded costs, defensible security properties, and failure modes you understand, is the engineering problem. The loop is the starting point, not the destination.