· 6 min read ·

The Primary Lever in Agentic Engineering Changes at Every Level

Source: simonwillison

The mental model most teams bring to their first agentic system is the same one that served them at a chat interface: when output quality is low, improve the prompt. That model works fine for Level 1 work (stateless text generation) and holds up reasonably well for Level 2 (single tool invocations). But somewhere around the fifth step of a multi-step chain, the primary source of failures shifts away from language quality and toward software engineering discipline, and teams that don’t notice the transition spend months optimizing prompts for a class of failures that prompts cannot fix.

Simon Willison’s guide on agentic engineering patterns is a systematic attempt to define the discipline, and worth reading for the taxonomy alone. What it prompted me to think about more carefully is where the leverage lives at each level, because the answer is different at each one.

What the levels actually represent

A useful taxonomy maps agency onto five levels, roughly analogous to SAE’s driving automation levels:

  • Level 1 — Stateless generation: One prompt, one response. Engineering is prompt design and output parsing.
  • Level 2 — Single tool use: The model can call functions, invoke APIs, read or write external state. Anthropic’s tool_use content blocks and OpenAI’s function calling are the primitives.
  • Level 3 — Multi-step planning: The model reasons over a sequence of actions, executes them serially, and uses output from earlier steps to inform later ones. The ReAct pattern (Yao et al., 2022) is the dominant structure here.
  • Level 4 — Persistent memory: The agent stores and retrieves state across sessions, typically using vector databases or structured stores.
  • Level 5 — Multi-agent coordination: Multiple LLM instances with distinct roles, passing work between each other. Frameworks like LangGraph, AutoGen, and CrewAI live at this level.

The level matters not because it labels capability but because it determines what engineering problem you are actually solving.

The phase transition at Level 3

At Levels 1 and 2, prompt quality is the dominant variable. Better instructions, better examples, clearer tool descriptions, cleaner output schemas — these determine whether the system works. Language engineering is the appropriate lever because the model’s language handling is the rate-limiting factor.

At Level 3, this breaks down due to basic probability arithmetic. With a 95% per-step success rate (optimistic for anything involving external state), a 10-step chain succeeds roughly 60% of the time. A 20-step chain: around 36%. A 50-step chain: about 8%. The model hasn’t become less capable; the error budget has shrunk because each step introduces an independent opportunity for silent failure that feeds forward into the next step.

The failure modes that matter most at this level are not language problems. The MAST taxonomy (NeurIPS 2025) identified the four most failure-predictive modes in multi-step agents: unawareness of termination condition, premature termination, loss of conversation history, and drifting off-task. The structural mitigation for these is a workflow state machine with explicit named states and transitions, not a better system prompt.

A minimal pattern:

export type StepStatus =
  | "pending" | "running" | "passed" | "failed" | "skipped" | "rolled-back";

export interface WorkflowStep {
  label: string;
  command: string;
  rollback?: string;
  timeoutMs?: number;
}

export interface WorkflowRun {
  id: string;
  steps: WorkflowStep[];
  results: StepResult[];
  status: "running" | "completed" | "failed" | "rolled-back";
}

On step failure, an executor walks backward through completed steps and runs rollback commands in reverse. That is a minimum viable pattern. LangGraph’s checkpointing and Temporal’s durable execution provide this at framework level with crash recovery and exactly-once semantics.

The ReWOO paper (Xu et al., 2023) makes a useful quantitative argument here: structural interventions (state machines, verification steps, explicit planning structure) yielded roughly 53% improvement in multi-step reliability in their evaluation, against roughly 16% from prompt engineering alone.

Tool use is three problems, not one

Level 2 feels straightforward but contains a subtlety that becomes expensive at Level 3. Using a tool correctly involves: selecting the right tool, forming valid arguments for it, and scheduling the call at the right position in the dependency graph. Most frameworks and most mental models collapse these into a single reasoning step, but they fail independently.

The OpenEnv Calendar Gym benchmark (Meta + Hugging Face, 2026) showed roughly 90% success on tasks where parameters were provided explicitly, dropping to around 40% on tasks described in natural language. More than half of the failures were argument formation errors: malformed datetimes ("2026-01-15 09:30:00" instead of RFC 3339 "2026-01-15T09:30:00-05:00"), missing required fields, incorrect nesting. The model selected the right tool; it could not translate the natural language description into a valid API call.

This is a grounding problem, not a reasoning problem, and the fix is different: explicit schema validation before tool execution, structured parsing of natural language into typed parameters, and clearer error feedback when a call fails validation. Anthropic explicitly frames tool descriptions as a prompt engineering problem, and cleaner descriptions do reduce wrong-tool selection, but they do relatively little for argument formation errors on fields the model has never seen.

Context engineering and where information lives

Andrej Karpathy coined “context engineering” around mid-2025 to describe what had become a distinct discipline: deciding what information goes into the context window and how. For agentic systems, this breaks into three layers.

The first is static injection: files like CLAUDE.md or .cursorrules that load unconditionally on every session. These are best kept as concise operational conventions, not documentation.

The second is semi-static indexing: tools like Aider’s repo map that auto-generate symbol indices from ctags or tree-sitter, giving the model function signatures and locations without loading full file contents. This saves significant tokens over a “load everything” approach.

The third is dynamic retrieval: tools that fetch context on demand (file read, grep, directory listing). The Model Context Protocol (Anthropic, November 2024) standardizes this as a client-server protocol for exposing tools, resources, and prompt templates to any compatible agent. It decouples what the agent can know from what gets loaded at startup.

The “lost in the middle” problem (Liu et al., 2023) remains practically relevant: model attention degrades for information positioned in the middle of a long context. Critical instructions should sit at the start (system prompt position) or end, not buried halfway through a 100K-token context.

The security surface expands with each level

At Level 1, the security concern is direct prompt injection: a user input that overrides the system prompt. At Level 2, indirect injection enters, where an adversary plants instructions in content the agent retrieves. The Greshake et al. paper (2023) documented this systematically. Willison’s dual-LLM quarantine pattern (2023) remains one of the cleaner structural responses: a restricted LLM processes untrusted content without action authority; a privileged LLM acts on sanitized summaries.

At Level 5, injections propagate through agent trust chains. In a 3-hop pipeline where each hop has an 18% susceptibility rate, the cumulative probability of an injection reaching the orchestrator is roughly 45%. The missing abstraction across all current production frameworks is output provenance tracking: knowing which retrieved content influenced which agent output. Microsoft’s Spotlighting technique (structural delimiters around retrieved content) reduced indirect injection by roughly 95% on GPT-4 with a 1-3% accuracy cost on benign tasks, which is a reasonable trade for most production systems.

Evaluation lags capability

Standard unit tests work at Level 1. They degrade at Level 2 (the output is correct but the tool call graph was wrong) and are nearly useless at Level 3 (a 20-step run that arrives at the right answer via the wrong intermediate reasoning). What is needed at each level is different: trajectory testing at Level 2 (assert tool call structure, not just output), partial-credit evaluation at Level 3 (frameworks like AgentBench and the GAIA benchmark), memory poisoning tests at Level 4, and contract testing at Level 5.

Most production agentic systems today are tested primarily through manual end-to-end runs. The evaluation infrastructure lags capability infrastructure by a meaningful margin, which is a practical reason to stay conservative about which level of agency a given system needs.

Where this leaves the discipline

Agentic engineering as a discipline is still in the phase where the vocabulary is being established. Willison’s guide is useful for that vocabulary. The practical implication is that the engineers who will build reliable Level 3+ systems are the ones who bring distributed systems thinking to the problem: state machine design, idempotent operations, structured tracing, checkpoint recovery. These are not new skills; they are skills that apply to a new context.

Prompt engineering is still necessary. It just stops being sufficient around the time you need your tenth tool call to work correctly.

Was this interesting?