Agentic Engineering Is a Real Discipline, Not Just Prompting With Extra Steps

Simon Willison recently published a guide on agentic engineering patterns, defining what he calls “agentic engineering” as a coherent practice. The framing is useful because the field has been badly served by vague terminology, and getting the vocabulary right matters when you are trying to build something that actually works.

The core idea is straightforward: agentic systems are programs where an LLM drives a loop, calling tools and acting on results until some goal is met. That sounds simple. The engineering challenges hidden inside that loop are not.

What Makes It a Discipline

Prompt engineering is about crafting inputs to get better outputs from a model in a single exchange. Agentic engineering is about orchestrating sequences of model calls, tool invocations, and state transitions across many steps, each of which can fail independently. The failure modes compound.

Consider the difference in verification complexity. A single prompt either produces a useful response or it does not. An agent running a ten-step workflow can fail at step three in a way that produces plausible-looking output at step ten. Catching that requires reasoning about the entire trace, not just the final response.

This is why agentic engineering borrows more from distributed systems thinking than from NLP. You are designing for partial failures, for retries, for state consistency across steps. The model is less like a function and more like an unreliable RPC endpoint that nonetheless has to coordinate real side effects.

The Agent Loop

Most agentic frameworks converge on a variation of the ReAct pattern introduced by Yao et al. in 2022: Reasoning and Acting interleaved. The model generates a thought, selects a tool call, observes the result, generates another thought, and so on until it produces a final answer or hits a stopping condition.

In practice the loop looks like this:

while not done:
    response = model(context)
    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        context.append(tool_result(result))
    else:
        done = True
        return response.content

That is the skeleton. The real work is in everything surrounding it: how the context grows, when to truncate it, how tool errors are represented and surfaced back to the model, and what “done” actually means for a given task.

The context window is not just a buffer. It is the process state. Everything the agent knows about what has happened, what it has tried, and what it is supposed to be doing lives in that window. When the window fills up, you lose state. When the state is corrupt or misleading, the agent’s subsequent decisions degrade in ways that are often subtle and hard to detect at runtime.

Tool Design as API Contract

Willison’s framing emphasizes that tool descriptions are not documentation, they are the primary interface through which the model understands what it can do. A badly described tool is like a function with a misleading name and no type signature. The model will misuse it.

Consider a tool named run_query with the description “Runs a database query.” The model has no way to know whether this tool is read-only, whether it auto-commits, what the connection limits are, or what error format to expect. A better description constrains the model’s behavior before the call happens:

{
  "name": "run_read_query",
  "description": "Execute a read-only SQL SELECT against the analytics database. Returns at most 1000 rows as JSON. Raises an error for any non-SELECT statement. Do not use for writes or schema changes.",
  "parameters": {
    "sql": {
      "type": "string",
      "description": "A valid SELECT statement. Must not include semicolons or multiple statements."
    }
  }
}

The constraint in the description is doing real work. The model cannot read your intent; it reads the schema and the description. Every ambiguity in those gets interpreted probabilistically, which means inconsistently.

This is the part of agentic engineering that most directly resembles API design, and it deserves the same rigor. Breaking a tool’s contract silently is as bad as changing a REST endpoint’s response shape without versioning it.

Context Management Under Pressure

Long-running agents accumulate context fast. A workflow that calls ten tools, each returning 500 tokens of output, has consumed 5000 tokens before the model has written a word of its own. Add the system prompt, the initial user request, and the interleaved model responses, and most real tasks push against practical context limits within a handful of steps.

The strategies for managing this are well understood but each carries trade-offs:

Summarization compresses old tool results into a condensed narrative. It reduces token count but loses fidelity. If the summary omits a detail the agent needs in step fifteen, the agent will either hallucinate a substitute or fail in a way that is hard to trace back to the compression.

Retrieval stores intermediate results outside the context window and pulls them back on demand via a search tool. This scales better but introduces a new dependency: the agent now has to know what it needs before it needs it, which requires some degree of planning the model may not do reliably.

Windowed truncation simply drops the oldest context. This is fine for some conversational tasks and catastrophic for tasks where early decisions constrain later ones.

The right choice depends on the task structure, and knowing the task structure well enough to make that choice is itself part of the engineering work.

Prompt Injection and the Trust Problem

Agentic systems have a security property that single-turn LLM use does not: they act on content from external sources, and that content can contain instructions. Prompt injection is the attack vector where malicious text in the environment, a web page the agent fetches, a file it reads, a database row it retrieves, attempts to hijack the agent’s behavior by embedding instructions the model might follow.

This is not a hypothetical. Willison has documented real cases where agents browsing the web were redirected by injected instructions in HTML. In an agentic context where the model has access to email, files, or external APIs, the blast radius of a successful injection scales with the agent’s permissions.

The minimal footprint principle is the main defensive response: agents should request only the permissions they need for the current task, prefer reversible actions over irreversible ones, and confirm with humans before taking actions with broad side effects. This is standard least-privilege thinking applied to a context where the principal is a language model whose behavior under adversarial input is probabilistic, not deterministic.

No purely technical mitigation fully solves this. The current state of the art is defense in depth: careful tool scoping, sandboxing tool execution where possible, logging all tool calls for audit, and designing approval flows for high-impact operations.

Multi-Agent Coordination

The next layer of complexity is systems where multiple agents work together. A common pattern is the supervisor-worker architecture: one orchestrating agent decomposes a task and delegates subtasks to specialized worker agents, then synthesizes their results.

This scales well for parallelizable work. A research agent can fan out to several worker agents searching different sources simultaneously, then merge the results. The wall-clock time for the task can be much lower than a single-agent linear approach.

But it introduces new failure modes. The orchestrator has to trust the worker agents’ outputs, and those outputs arrive as text in the orchestrator’s context window. There is no type system enforcing that a worker’s response matches the schema the orchestrator expects. Error propagation across agent boundaries is poorly defined in most current frameworks. When a worker fails partway through, the orchestrator may not receive a signal that cleanly represents what happened.

Frameworks like AutoGen, CrewAI, and smolagents each make different architectural bets about how to handle these problems, but none has converged on a solution that feels stable. The distributed systems analogy holds: you are essentially doing RPC between unreliable, stateful services, and the standard answers (idempotent operations, explicit error types, structured logging) apply.

What This Means in Practice

For anyone building agentic systems today, the practical takeaways from Willison’s framing are concrete. Design tools the way you design APIs: versioned, typed, with explicit contracts. Treat context as a first-class resource with a budget, not an unbounded accumulator. Log every tool call with its inputs, outputs, and the model’s reasoning trace. Build approval gates for irreversible operations.

Agentic engineering is not a new category of magic. It is software engineering applied to a system component, the language model, that has unusual properties: it is powerful and flexible but also nondeterministic, context-sensitive, and vulnerable to input manipulation. The discipline is about building the scaffolding around that component carefully enough that the system as a whole behaves reliably.

The guides Willison is assembling at simonwillison.net are among the more grounded resources available on this right now. Most writing on agents either oversells autonomous capability or gets lost in framework comparisons. The focus on patterns and failure modes is where the useful engineering knowledge actually lives.