· 6 min read ·

What Agentic Engineering Requires Beyond Tool Calling

Source: simonwillison

Simon Willison published a guide to agentic engineering patterns on March 15, 2026. The title poses the definitional question directly, which is overdue. The word “agent” has been applied to so many different LLM product configurations that it has lost precision. The technical core the term should describe is specific: a system where a language model can take actions, observe the results, and use those observations to decide what to do next. The loop defines the capability. Without it, you have a sophisticated text completion; with it, you have something that can pursue a goal across multiple steps.

What “Agentic” Actually Means

The minimal agentic system has four components: a model, a set of tools, a context window, and a loop. The model reads the context, produces either a final response or a tool call, the tool executes and returns a result, that result gets appended to the context, and the cycle repeats until the model signals completion. Everything else, the framework choices, multi-agent coordination, memory systems, builds on top of this structure.

The pattern was formalized by the ReAct paper (Yao et al., 2022), which demonstrated that interleaving explicit reasoning traces with action calls produced better task completion than either pure chain-of-thought reasoning or pure action sequences. A model that can narrate its observations between steps, reflecting on what just happened before deciding what to do next, handles uncertainty better than one committed to a fixed plan.

Tool calling became a shipping API feature in 2023. OpenAI added function calling to the GPT-4 API; Anthropic followed with tool use for Claude. The wire format is straightforward: you pass tool schemas alongside your messages, and the model either responds with text or a structured tool call for your application to execute. The model never runs code directly; it requests execution on its behalf.

{
  "type": "tool_use",
  "name": "read_file",
  "id": "toolu_01xyz",
  "input": {
    "path": "/src/main.py"
  }
}

Your application runs the tool, returns the result, and the loop continues. The engineering work is everything surrounding that exchange.

Tool Design as a First-Class Concern

The quality of an agentic system depends more on tool design than most people expect when they first build one. A model can only do what its tools permit, and a poorly specified tool is a consistent source of failure.

Good tool schemas are unambiguous in naming, thorough in their descriptions of edge cases, and return structured output the model can process without further inference. A tool that dumps raw HTML when the model expects structured data burns context on parsing attempts. A tool with an ambiguous name gets called in the wrong situation. A tool with no documentation for error states produces confusing behavior when those states occur.

Anthropic’s Model Context Protocol (MCP), released in late 2024, is an attempt to standardize the interface between models and tools. The protocol defines how tools expose their schemas, how resource references work, and how prompts can be templated. An MCP server can be consumed by any MCP-compatible client without custom integration code. By early 2026, the adoption across developer tooling is substantial: databases, code analysis platforms, and API services all ship MCP servers. What was previously scattered across bespoke LangChain integrations is converging toward a common interface.

This is significant infrastructure. Writing custom glue code for every tool integration was not sustainable as the number of tools per application grew. MCP addresses the same coordination problem that HTTP addressed for web services, with similar tradeoffs: a common protocol enables interoperability but constrains how you work within it.

The Reliability Problem

Running a single LLM call is straightforward. Running a loop that terminates correctly, completes its task, and handles unexpected tool outputs gracefully is a different kind of challenge.

The two failure modes that come up most often are divergence and over-confidence. A diverging agent gets stuck, calling the same tool repeatedly or pursuing a strategy that stopped working, unable to recognize it should stop or try something different. An over-confident agent finishes, formats a clean response, and the downstream errors surface later when the user acts on them.

These failure modes compound in multi-step systems. An error in step three of an eight-step task changes the information available to every subsequent step. The final response can look entirely plausible while resting on a faulty premise established several tool calls earlier. Traditional debugging tools are not well-suited to this; the failure is distributed across a probabilistic process.

Practical mitigations include setting explicit termination conditions in the system prompt, limiting the tool set to only what a given task requires, adding confirmation steps before destructive operations, sandboxing tool execution, and adding loop depth limits to catch runaway agents. None of these fully eliminates the problem. They reduce the probability and impact of failures. The work is closer to how you harden a distributed system than how you debug a deterministic function.

Prompt Injection and Security

Agentic systems face a security problem with no clean solution: if any tool can return content from untrusted sources, that content ends up in the model’s context. Willison has written about prompt injection at length, and the core argument is that this is a fundamental structural problem for agentic systems, not a bug to be patched. A web search tool can return a page containing instructions that redirect the model’s subsequent behavior. An email reading tool exposes the agent to messages crafted to hijack its actions.

Unlike SQL injection, there is no parameterization technique that resolves this. The model needs to read the content to determine whether it is adversarial, and by the time it reads it, the content is already influencing its processing. Defense in depth applies: sandboxed execution, minimal permissions, human confirmation for consequential actions, and treating all external content as untrusted at the tool design level.

This security framing matters for how you scope tool permissions. An agent reading files to answer questions about a codebase should not also have write access. An agent summarizing emails should not have access to send them. The principle of least privilege, a standard in conventional security engineering, applies here with extra weight because the model may be manipulated into using capabilities the developer never intended for that task.

The Scaffolding Question

Framework choices for agentic systems have proliferated since 2023. LangChain, LlamaIndex, AutoGen, CrewAI, and a range of newer alternatives each make different tradeoffs between flexibility and imposed structure. The choice affects how much of the agent’s control flow lives in your code versus emerging from the model’s decisions.

Willison’s documented approach with tools like LLM, his command-line LLM utility, leans toward minimal abstractions: write the loop explicitly, make tool calls visible, treat the model as a component rather than an orchestrator. This is a reasonable default while characterizing a system’s behavior. Opaque frameworks make debugging harder, and agentic systems have enough emergent behavior that you want the mechanics visible while you are learning the failure modes.

The case for more structure comes later, once you have characterized the behavior and want to standardize across a team or an application surface. Adding abstractions after you understand the system is easier than stripping them out once they have obscured something you need to see.

Where the Discipline Stands

Willison framing this as a guide with repeatable patterns is itself evidence that the field has matured. In 2023, most writing about agents was either academic or aspirational. The failure modes were not well-characterized; the tooling infrastructure was thin; production deployments were rare enough that the patterns had not been stress-tested.

By 2026, the patterns exist and are documented. Agentic engineering draws from distributed systems thinking (managing state across unreliable components), API design (tool schemas are interfaces that benefit from the same discipline you bring to public APIs), security engineering (external inputs are untrusted), and probabilistic systems testing (you characterize behavior statistically rather than by exhaustive path coverage). The novel part is the combination, and the specific ways those concerns interact when a language model is driving control flow decisions.

For anyone building anything with LLMs that involves more than a single call, this is the discipline where the engineering effort concentrates.

Was this interesting?