· 6 min read ·

The Engineering Layer Beneath Every AI Agent

Source: simonwillison

There is a line worth drawing between prompting an LLM and engineering an agentic system. Prompting is about coaxing a single good output from a model. Agentic engineering is about building the scaffolding that lets a model take a sequence of actions, use external tools, and complete tasks that span multiple steps over time. Simon Willison’s guide to agentic engineering patterns makes this distinction explicit: the model is not the system, it is a component inside the system. Once you internalize that framing, the entire shape of the problem changes.

The Agent Loop

At the core of every agentic system is a loop. The model receives a context window, produces output that may include tool calls, the scaffolding executes those calls, appends the results to the context, and hands control back to the model. This continues until the model signals completion or the system hits a stopping condition.

This pattern has been formalized in different ways. The ReAct paper from 2022 described it as Reason + Act cycles, where the model alternates between reasoning traces and action steps. More recent frameworks like LangGraph implement this loop as an explicit state machine with edges representing possible transitions between reasoning steps and tool invocations.

The loop looks deceptively simple on a diagram. In practice it surfaces every hard problem in distributed systems: you need retry logic for failed tool calls, timeout handling for slow external APIs, error propagation that gives the model enough context to recover, and some notion of when to give up entirely. A model that encounters an unexpected error mid-task and silently loops is not a demo failure. It is an engineering failure, and it is preventable with the same discipline you would apply to any long-running process.

The Context Window Is Your Process State

One useful way to reason about agentic systems is to treat the context window as process state. Everything the agent knows about its current task, every tool result, every prior reasoning step, lives in that window. When the window fills, you have to make decisions about what to evict, summarize, or offload to external storage.

This is structurally similar to the decisions operating systems make under memory pressure, except the cost is measured in tokens rather than page faults, and the eviction policy directly affects the quality of the agent’s decisions. Summarizing prior steps to save tokens can cause the agent to lose track of constraints it agreed to earlier in the run. Keeping raw tool output verbatim burns context budget fast, especially for tools that return large JSON payloads from external APIs.

Good agentic engineering means designing tool outputs to be compact and semantically rich, placing durable constraints early in the system prompt where they are hardest to crowd out, and building context management logic that understands task structure rather than operating purely on token count. None of this is automatic. It requires deliberate design.

Tool Design Is API Design

If the context window is process state, tool definitions are the API surface that the model calls into. The Anthropic tool use documentation makes this concrete: tool descriptions are the primary interface between the model and your system. The model cannot inspect implementation; it infers everything about when and how to call a tool from the name, description, and parameter schema you provide.

This means tool design requires the same care as public API design. A tool named get_data with a vague description will be called incorrectly. A tool that returns errors as plain strings rather than structured objects gives the model less to work with when deciding whether to retry or escalate. A tool with fifteen optional parameters is asking the model to make decisions it does not have enough context to make well.

The discipline here is close to interface design in systems programming. You define boundaries, encode invariants in types and schemas, and write documentation that is precise rather than exhaustive. The difference is that your documentation is being read by a statistical model that will fill gaps with plausible-sounding inference, so ambiguity has a more unpredictable failure mode than it does with human API consumers. When a human misreads a vague parameter name, they ask a clarifying question or read the source. A model invents a reasonable-sounding interpretation and proceeds.

A concrete example: if you have a tool that deletes records, naming the parameter id rather than record_id_to_permanently_delete is not just a style choice. It affects how cautiously the model will reach for that tool when the context is ambiguous. The model’s behavior is shaped by every word in your schema.

Prompt Injection: The Attack Surface That Scales With Capability

Agentic systems that interact with external content inherit a structural security problem. When a model processes user-supplied documents, web pages, emails, or database records, any of that content can carry instructions designed to redirect the model’s behavior. This is prompt injection, and its severity scales directly with how much capability the agent has.

A single-turn chatbot that processes a malicious document can be tricked into producing bad output. An agent with tool access that processes the same document can be tricked into calling tools with attacker-controlled parameters, exfiltrating data to external endpoints, or modifying state in ways the legitimate user never authorized. The blast radius grows with every tool you add.

Simon Willison has written about this as one of the hardest unsolved problems in LLM security. The core difficulty is structural: models process instructions and data in the same channel, the context window, so there is no hardware or OS-level separation between “trust this” and “analyze this but do not obey it.” Defenses exist, including confirmation steps before high-impact tool calls, privilege separation between agent roles, and explicit skepticism instructions baked into the system prompt, but none of them are complete. They reduce the attack surface; they do not eliminate it.

This is worth taking seriously before deploying agents into production workflows. An agent that reads emails and has permission to send replies represents a meaningfully larger attack surface than a chatbot, and the risk scales with the breadth of tool access rather than the sophistication of the model.

Observability Is Not Optional

With traditional software, you can attach a debugger, inspect the stack, and follow execution step by step. With an agentic system, the reasoning that led to a particular tool call is interleaved with the model’s output as tokens, not as a call stack you can pause and inspect.

The practical implication is that structured logging at every tool boundary is load-bearing infrastructure, not a nice-to-have. Every tool call, its parameters, its result, and the model output that preceded it needs to be captured with a consistent trace ID that spans the entire run. Tools like LangSmith provide this as a managed layer for LangChain-based systems. Anthropic’s prompt workbench offers tracing for Claude-based workflows. Both are valuable, but the underlying discipline applies regardless of framework: when an agent does something unexpected in production, you need enough recorded state to reconstruct why.

This is the same principle as distributed tracing in microservices architectures. Requests cross asynchronous boundaries and you cannot reconstruct causality from logs alone without propagating correlation identifiers. In an agentic system, the “service boundary” is the model inference call, and the correlation context is the thread of reasoning you need to follow after the fact.

Why the Word “Engineering” Is Earned

Agentic engineering is not prompt engineering with more steps. Prompt engineering is about getting good outputs from a model. Agentic engineering is about building reliable, secure, and observable software systems that happen to include a model as a component. The model brings capabilities that would be hard to construct any other way; the engineering layer is what makes those capabilities deployable.

Systems that fail in production because the error handling path was never designed, or that get compromised because the injection surface was not considered, are not model failures. They are engineering failures, and the disciplines that prevent them, interface design, state management, observability, security review, are the same ones that apply to any other software system. The model is new. The engineering is not.

Was this interesting?