The framing in Simon Willison’s guide to agentic engineering patterns is a useful starting point: agentic engineering is the discipline of building systems where a language model takes multi-step actions to complete tasks, rather than producing a single response to a single prompt. The moment you introduce a loop, you are doing something qualitatively different from prompt engineering. But that framing, useful as it is, can give the impression that the core engineering challenge is enabling autonomy. The harder and more consequential problem is constraining it.
Autonomy Is a Dial, Not a Switch
Production agentic systems almost never live at either extreme of the autonomy spectrum. Fully scripted pipelines, where the developer controls every branch and the model is a text transformation component, are predictable but brittle. Fully autonomous systems, where the model takes arbitrary actions in the world without human checkpoints, are powerful but risky. Most useful systems occupy some position between those poles, and choosing that position deliberately is one of the central engineering decisions in building agentic software.
Four properties determine where a given action should sit on that spectrum:
- Reversibility: can the action be undone if the model made a wrong decision? Reading a file is fully reversible; dropping a database row may not be.
- Blast radius: if this action goes wrong, how much damage can it cause? Searching the web affects nothing; sending an email to a thousand users affects a lot.
- Predictability: how consistent is the model’s behavior on this class of action? Tasks with clear, constrained outputs are more predictable than tasks requiring nuanced judgment.
- Stakes: what is the cost of a wrong decision relative to the cost of an unnecessary confirmation prompt? For routine automation, interruptions are expensive; for irreversible actions, they are cheap.
A tool that runs a read-only database query sits at one end of this space: low stakes, zero blast radius, highly predictable, completely reversible. A tool that pushes a deployment or sends a message to an external user sits at the other end. Applying the same autonomy level across an entire tool set, either requiring confirmation for everything or nothing, produces systems that are either tedious or dangerous.
Designing the Interrupt Mechanism
The interrupt is the core engineering artifact for managing autonomy. It is the point where a running agent pauses, surfaces a decision to a human, and waits before proceeding. The structural mistake most teams make is treating this as an add-on: a special case inside individual tool implementations rather than a first-class part of the dispatch layer.
Here is a pattern that keeps confirmation logic separate from tool logic:
from enum import Enum
from dataclasses import dataclass
from typing import Any, Callable, Optional
class RiskLevel(Enum):
READ_ONLY = "read_only"
REVERSIBLE = "reversible"
DESTRUCTIVE = "destructive"
EXTERNAL = "external" # sends messages, makes API calls with side effects
@dataclass
class Tool:
name: str
description: str
risk: RiskLevel
schema: dict
ConfirmFn = Callable[[str, dict], bool]
def dispatch_with_interrupt(
tool: Tool,
inputs: dict,
confirm_fn: Optional[ConfirmFn] = None
) -> Any:
requires_confirmation = tool.risk in (
RiskLevel.DESTRUCTIVE,
RiskLevel.EXTERNAL
)
if requires_confirmation and confirm_fn is not None:
approved = confirm_fn(tool.name, inputs)
if not approved:
return {"status": "skipped", "reason": "operator declined"}
return execute_tool(tool, inputs)
The key structural decision is that confirm_fn is injected. In an interactive session, it sends a message and waits for a response. In an automated test harness, it simulates approval or rejection to exercise both paths. In a CI pipeline running within a known safe scope, it returns True for whitelisted operations and False for anything outside that scope. The same agent code runs in all three contexts; the autonomy policy changes based on how the dispatcher is configured.
This pattern also makes the policy auditable. Every confirmed or declined action is logged at the dispatch layer with the tool name, inputs, and decision. When something goes wrong, you have a complete record of what the agent wanted to do and what was approved.
The Minimal Footprint Principle
Anthropic’s documentation on building safe agentic systems describes what it calls the minimal footprint principle: agents should request only the permissions they need, avoid storing sensitive information beyond immediate needs, prefer reversible actions over irreversible ones, and err on the side of doing less and confirming when uncertain about the intended scope of a task.
This is sound engineering advice independent of any AI safety framing. A system with wide permissions is hard to test, hard to audit, and expensive to recover from when it misbehaves. Scoping permissions tightly reduces the behavior surface and makes failure modes more predictable.
In practice, this translates to a few concrete patterns. Read tools can be granted freely; write tools should be gated. Where a tool can show you what it would do before doing it, build that preview capability first and make it the default path. Log the model’s stated reason for calling a tool alongside the call itself, so audit trails are interpretable rather than just a sequence of actions.
For my own Discord bot work, this has meant separating planning tools from execution tools. The planning phase, searching, reading, analyzing, can run fully autonomously because the cost of a wrong decision is low. The execution phase, writing to databases, posting messages, modifying configuration, requires a confirmation step in production. The architecture makes the boundary explicit rather than relying on the model to know when to be cautious.
Trust as an Engineering Variable
One productive way to think about the autonomy dial is as a trust variable that changes over time. You start conservative, with confirmation required for most consequential actions. As the system demonstrates reliable behavior on a class of task, you relax the constraints for that task class.
This mirrors how trust works in human organizations. A new team member operates with narrow permissions and frequent check-ins; a proven one gets broader latitude. The difference with agentic systems is that you need explicit mechanisms to track evidence and update policies, rather than relying on informal social processes.
A practical approach: tag each tool call in your logs with an outcome label, whether the model did the right thing, accumulate outcomes by tool type and task category, and use that data to inform confirmation policy. Two hundred successful calls to send_notification in a specific workflow is meaningful evidence. Three incorrect calls in the last fifty is a signal to add a checkpoint. The evaluation infrastructure has to exist before you need it; retrofitting it after a production incident is significantly harder.
The Evaluation Problem
Traditional software systems have deterministic control flow. You can trace every path through the code, test each branch, and have reasonable confidence about behavior at the boundaries. Agentic systems have stochastic control flow. The model decides what to do next, and while that decision is usually reasonable, it is not predictable the way a branching statement is.
This changes the testing contract substantially. Unit tests for individual tools are necessary but insufficient. You also need to test the agent’s decision-making about when and how to use those tools, which requires running the agent against realistic tasks and evaluating the quality of its action sequences, not just the outputs.
The dominant approach for this is golden traces: representative task scenarios with known-correct action sequences that you can compare against. LangSmith and Weights and Biases Weave have both built tooling specifically for agentic trace evaluation, treating each run as an annotatable sequence of spans rather than a flat log stream. LLM-as-judge evaluation, where a second model assesses the quality of the first model’s decisions, has also become common for cases where ground truth is hard to specify, though it introduces its own reliability questions, particularly when the evaluator and the evaluated model share similar biases.
The ReAct paper from 2022, which formalized interleaved reasoning and acting in LLM systems, was an early signal that this evaluation problem was going to be hard. You cannot just score the final output; you have to assess whether the reasoning and tool-use sequence was sound, because a correct final answer reached by bad intermediate steps is still a fragile system.
Calibration Is the Discipline
Simon Willison’s point that agentic engineering is a genuine engineering discipline, not just prompt optimization, is well-taken. But the discipline is not primarily about maximizing what a model can do autonomously. It is about choosing the right level of autonomy for each operation, building reliable mechanisms to enforce those choices, and accumulating evidence to make informed decisions about where to relax constraints as the system matures.
The systems that hold up in production are the ones that are autonomous where the risk and predictability profile supports it, and confirmatory where it does not. Getting that calibration right, and building the tooling to maintain it as the model, the tools, and the task distribution all evolve, is most of what agentic engineering involves.