· 5 min read ·

Agentic Engineering Is a New Discipline, Not a Prompt Trick

Source: simonwillison

The phrase “agentic engineering” has been circulating in technical circles for the past year, and like most terms that emerge from fast-moving fields, its meaning gets stretched by everyone who picks it up. Simon Willison has been building toward a precise definition in his ongoing engineering guides, and it’s worth taking seriously because the discipline really is distinct from what most engineers have done before.

The clearest framing: an agentic system is one where a language model drives a loop. The model receives context, decides on an action, executes that action through a tool, observes the result, and feeds the observation back into its next decision. This is categorically different from a single-shot prompt that generates text. The model is not a glorified autocomplete endpoint; it’s the control flow.

What Makes a System Agentic

The minimal viable agentic system has three parts: a model, a set of tools, and a loop that connects them. Tools can be anything that returns structured output, a web search, a code interpreter, a database query, a shell command, an HTTP request to an internal API. The model reads tool output the same way it reads everything else, as tokens in its context window.

The ReAct pattern, introduced by Yao et al. in 2022, formalized this as interleaved reasoning and acting: the model writes a “thought,” then an “action,” then receives an “observation,” then writes the next thought. Most production agent frameworks implement some variant of this, though they vary in how explicitly they enforce the structure. LangChain, LlamaIndex, and Anthropic’s own Claude tool use API all expose this pattern at different levels of abstraction.

A concrete example using the Anthropic SDK illustrates the loop:

import anthropic
import json

client = anthropic.Anthropic()

tools = [
    {
        "name": "run_sql",
        "description": "Execute a SQL query and return results as JSON",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string", "description": "The SQL query to run"}
            },
            "required": ["query"]
        }
    }
]

messages = [{"role": "user", "content": "Which users signed up last week?"}]

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

    if response.stop_reason == "end_turn":
        print(response.content[0].text)
        break

    tool_use = next(b for b in response.content if b.type == "tool_use")
    result = execute_tool(tool_use.name, tool_use.input)  # your dispatch function

    messages.append({"role": "assistant", "content": response.content})
    messages.append({
        "role": "user",
        "content": [{"type": "tool_result", "tool_use_id": tool_use.id, "content": json.dumps(result)}]
    })

The loop continues until the model stops requesting tools. This is the heart of agentic engineering: you are not writing the logic that decides what to do next. The model is.

Why This Requires a New Engineering Discipline

Traditional software engineering is deterministic by default. A function with given inputs produces predictable outputs; you can write tests that assert exact return values. Debugging is mostly a matter of tracing execution paths.

Agentic systems break all of that. The model may choose different tools in different runs. It may ask a clarifying question mid-task on one invocation and silently proceed on another. It may hit a dead end after three tool calls and hallucinate a result rather than asking for help. These behaviors emerge from probability distributions, not explicit branches.

Willison’s framing of “agentic engineering” as a distinct discipline is useful precisely because it pushes back against the tendency to treat LLMs as black boxes that you just wire up and ship. The engineering challenge is to design systems where model nondeterminism is bounded, failures are observable, and outputs are verifiable without requiring a human in the loop on every step.

Some concrete problems this surfaces:

Context window management. Each tool result gets appended to the context. Long-running agents accumulate thousands of tokens of intermediate state. At some point you need strategies for summarization, selective retention, or context compression, and these strategies change what the model “remembers” about earlier steps.

Tool design as API design. The description you write for each tool is effectively a prompt. A poorly described tool gets called incorrectly. If your run_sql tool’s description doesn’t mention that it only has read access, the model will eventually try to run an INSERT. Writing tool schemas is documentation work with real behavioral consequences.

Error recovery. When a tool call fails, what does the model do? Does it retry with modified input, ask the user for help, or fabricate a result and continue? The answer depends on how you’ve framed errors in your system prompt and what the model has seen in training. You can influence this through prompt engineering, but you cannot fully control it, so your system needs to be able to detect when the model has gone off track.

Evaluation is hard. Agentic tasks often have fuzzy success criteria. “Did the agent accomplish the goal?” is not always a question a unit test can answer. This has pushed teams toward building separate evaluation harnesses, sometimes using another LLM as a judge, which introduces its own reliability questions. Anthropic’s guidance on evals recommends starting with human-graded examples and using them to calibrate any automated scoring.

The Patterns That Have Stabilized

Despite the novelty, a few patterns have emerged as reliable across different frameworks and use cases.

Constrained tool sets. Giving an agent access to every possible action is a recipe for unexpected behavior. Scoped tool sets, limited to what the task actually requires, reduce the probability space the model navigates on each step. An agent answering questions about your documentation doesn’t need shell access.

Human checkpoints. For high-stakes tasks, inserting explicit pause points where the model surfaces its plan before executing irreversible actions is worth the friction. This is not a limitation imposed by immature tooling; it’s a design choice that reflects the actual risk profile of the task.

Structured outputs for interop. When one agent feeds output to another, freeform text is fragile. Requiring JSON schemas at agent boundaries, using something like Pydantic on the receiving side, catches format drift early rather than propagating bad data through multi-step pipelines.

Logging the full trace. Every tool call, every model response, every intermediate message should be logged somewhere inspectable. Debugging an agentic system that failed silently three steps ago is much harder without a replay of what the model was “thinking” at each step.

Where This Is Going

The term “agentic engineering” will probably settle into the broader vocabulary of software engineering the way “distributed systems” did: a specialization with its own patterns, failure modes, and tooling, sitting inside the larger discipline rather than replacing it.

What Willison’s framing captures well is that the work here is genuinely engineering work. Prompt tuning matters, but so does system design, observability, failure mode analysis, and eval infrastructure. The teams shipping reliable agents in production are not the ones who found a magic prompt; they’re the ones who treated the model as an unreliable external dependency and built the scaffolding to manage that unreliability systematically.

If you build Discord bots, the agentic pattern is immediately applicable: a bot that can loop over tool calls to answer multi-step questions, fetch data from several sources, and synthesize a coherent response is qualitatively more useful than one that makes a single API call per message. The engineering overhead is real, but so is the capability gap. Understanding the discipline is the prerequisite for building responsibly at that level.

Was this interesting?