Agentic Engineering Is an Architecture Problem, Not a Prompt Problem
Source: simonwillison
The word “agent” is doing a lot of work in the AI industry right now. Every framework ships agents. Every product demo ends with one completing some impressive multi-step task. Simon Willison’s guide to agentic engineering patterns cuts through the noise with a definition worth anchoring to: agentic engineering is the practice of building systems where a language model takes actions, observes the results, and uses those observations to decide what to do next.
That feedback loop is the whole thing, and everything else follows from it.
What Changes When You Add a Loop
A standard LLM call is a function: input goes in, output comes out. You send a prompt, you get text back. The model has no memory of previous calls, cannot affect the outside world, and terminates after one pass. This is adequate for a large class of problems.
An agentic system is different in kind, not just degree. The model can call tools, see what those tools return, and then decide to call more tools or produce a final answer. The execution path is not predetermined. A single user request might result in three tool calls, or thirty, depending on what the model encounters along the way.
This is what Willison means by the model being “in the loop.” It is not just that the model has tools available; it is that the model’s outputs (tool invocations) produce inputs (tool results) that feed back into subsequent model calls. The loop continues until the model determines it has finished.
Here is the minimal version of this in Python, using the Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "read_file",
"description": "Read the contents of a file from disk",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
}
]
messages = [{"role": "user", "content": "Summarize the contents of config.yaml"}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
print(response.content[-1].text)
break
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = dispatch_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
The while True at the center is not decoration; it is the architecture. The model runs, potentially calls tools, sees results, runs again. This pattern has a name in the research literature: ReAct (Reason + Act), introduced in a 2022 paper by Yao et al. at Google Brain. The paper demonstrated that interleaving reasoning traces with tool actions significantly improved performance on knowledge-intensive tasks compared to either approach in isolation. Most modern agentic systems descend from this pattern, whether their authors know it explicitly or not.
The Engineering Challenges That Follow from the Loop
Once the loop is the fundamental primitive, the engineering challenges clarify themselves.
Context accumulates. Every tool result appended to the message history grows the context window. A task requiring twenty tool calls will have a much larger context than one requiring two. This affects latency, cost, and accuracy, since models attend differently over long contexts. Practical agentic systems need strategies for context management: summarization passes, sliding windows, or structured state that lives outside the context entirely. The Claude API’s extended context window shifts where the ceiling sits, but it does not eliminate the pressure.
Errors compound. In a single-shot LLM call, a wrong answer is just a wrong answer. In an agentic loop, a wrong action produces a misleading tool result, which causes the next action to be wrong. A system with 90% per-step accuracy has roughly 35% end-to-end accuracy after ten steps. This is not solvable with better prompts alone; it requires architectural decisions like validation checkpoints, retry logic with varied strategies, and human-in-the-loop interruptions for high-stakes actions.
Tool design is load-bearing. The quality of the tools the model can call matters as much as the quality of the model itself. A poorly designed tool, one with ambiguous parameters, error messages the model cannot interpret, or side effects it cannot anticipate, will degrade the whole system. Tools need to be designed with the model as the caller in mind. That means rich, unambiguous descriptions, predictable return shapes, and explicit error signaling. Willison has emphasized this point repeatedly in his writing: the interface between the model and the world is where most agentic systems actually fail.
Costs are non-deterministic. A human typing a query costs nothing extra per word. A model autonomously running a loop of tool calls can, if the loop runs longer than expected, incur significant API costs. A system that worked fine in testing with five-step loops might encounter a task requiring fifty steps in production. Agentic systems need explicit loop limits, cost budgets, and real-time monitoring that traditional request-response applications do not require.
Orchestrators and Subagents
One pattern that emerges from these constraints is the orchestrator/subagent split. Rather than a single agent managing everything, a high-level orchestrator model plans and delegates, while specialized subagent models execute specific subtasks. The orchestrator might never call external tools directly; it calls other agents as its tools.
This mirrors how software systems decompose responsibility in general. A focused subagent doing code review has a smaller, more coherent context than a generalist agent doing code review, web search, email drafting, and database queries simultaneously. Specialization improves reliability at the cost of coordination complexity.
Frameworks like LangGraph and Anthropic’s own agent patterns documentation provide primitives for managing multi-agent coordination. More agents means more API calls, more contexts to manage, and more potential for miscommunication between layers. The fundamental tension between capability and complexity does not disappear with better tooling; it just becomes more manageable.
Where Agentic Systems Sit Relative to Older Automation
A cron job is not an agent. A deterministic script is not an agent. A state machine that follows a fixed graph of transitions is not an agent, even if it calls an LLM at each node.
The defining property is that the transition function is the model’s judgment, not a programmer’s enumeration. The model decides which tool to call, with what arguments, and whether to continue or stop. This is both the source of capability and the source of difficulty. You gain the ability to handle situations the programmer did not anticipate; you lose the ability to exhaustively reason about what the system will do in all cases.
This is not a reason to avoid the pattern. Plenty of valuable software operates with non-deterministic control flow: web browsers, operating system schedulers, database query planners. Building reliable systems with judgment-driven control flow requires different verification strategies than building deterministic pipelines. You need observability into the model’s reasoning, structured logging of every tool call and result, and tested failure modes rather than just tested happy paths.
The Framing That Makes It a Discipline
Willison’s framing of agentic engineering as a discipline is useful because it treats the engineering as the hard part. The model is a component. Building reliable systems around that component, systems that handle failures gracefully, stay within budget, maintain coherent state, and produce auditable outputs, is the actual work.
The term distinguishes this from prompt engineering (crafting inputs to get better outputs from one call) and from fine-tuning (modifying model weights). Agentic engineering is the practice of building the architecture around the model’s judgment: the tools, the loops, the checkpoints, the observability, and the failure modes.
For anyone building event-driven systems, whether Discord bots or background job processors, the mental model translates fairly directly. You already think about event handlers, state management, and error recovery across asynchronous steps. An agentic loop is an event loop where one of the event handlers is an LLM. The loop infrastructure, the retry logic, the context plumbing, and the failure handling are familiar territory. What is new is that the decision function inside the loop is no longer code you wrote; it is a model making judgment calls based on what it has seen so far.
That shift in where the logic lives is what makes agentic engineering its own thing, and what makes understanding the feedback loop the prerequisite for building these systems well.