The Specification Is the Software: Where Engineering Effort Goes in an Agentic System

Simon Willison recently published a guide to agentic engineering patterns that does something genuinely useful: it names a discipline that has been accumulating failure modes and hard-won patterns for the better part of three years without a coherent vocabulary. The term “agentic engineering” is doing real work. It is not prompt engineering, it is not “using AI,” and it is not a specialization of traditional software development. It is a distinct engineering discipline with its own primary artifacts, its own quality assurance methods, and its own debugging workflow.

The most clarifying question you can ask about any new discipline is: where does the work actually go? What do you spend your time writing, verifying, and debugging? For agentic engineering, the answers are different enough from conventional software development that engineers encounter it expecting one set of challenges and find a completely different one.

The Implementation Code Is the Smallest Part

The agent loop itself is a handful of lines. In Python with the Anthropic SDK, a minimal agentic system looks roughly like this:

while True:
    response = client.messages.create(
        model="claude-opus-4-6",
        tools=tools,
        messages=messages
    )
    if response.stop_reason == "end_turn":
        break
    tool_results = execute_tool_calls(response.content)
    messages.append({"role": "assistant", "content": response.content})
    messages.append({"role": "user", "content": tool_results})

That is genuinely most of the agent loop. The same structure appears in LangGraph, the OpenAI Agents SDK, and AutoGen, with varying amounts of framework wrapper around it. The loop is not where the engineering work lives.

The work lives in what surrounds the loop. The tools you define, the descriptions you write for them, the system prompt that scopes the agent’s behavior, and the context management strategy that determines what information survives a long session. These artifacts are what determine whether the agent completes real tasks reliably or behaves coherently in demos and falls apart under production load.

This is the central shift that agentic engineering represents. In conventional software development, the implementation is the primary artifact. The code is what you reason about, test, debug, and iterate on. In agentic systems, control flow is delegated to a stochastic model. The model decides which tools to call, in what order, with what arguments. Your engineering effort moves upstream to specification, because specification is the only place where you can reliably influence the model’s decisions.

What You Write Instead: The Specification Layer

The specification layer in an agentic system has three components, and each one demands more care than it gets from developers encountering the discipline for the first time.

Tool descriptions. When you define a tool, you are writing an API contract for a consumer that cannot read source code, cannot inspect types at runtime, and makes invocation decisions based entirely on natural language. The model routes to your tool based on the name and description you provide. If the description is vague, the model will call the tool at the wrong time. If it conflates two concerns, the model will use it for both. If it omits the output contract, the model will make assumptions about what the return value means.

Concrete guidance from production systems: tool names should be verb-noun pairs that read naturally in a chain-of-thought trace (search_repository, run_test_suite, write_file). Descriptions should specify what the tool does, what it explicitly does not do, what format it returns, and any preconditions the model should verify before calling it. Adding description fields to individual parameters in your JSON schema performs measurably better across evaluations than a single tool-level description. The Anthropic tool use documentation covers the schema format; the engineering judgment about what to put in it is where the work actually is.

System prompts. In production agents, the system prompt is not a one-liner. It is a scoping document. It defines the agent’s capabilities, the boundaries of its authority, the format of its outputs, how it should handle ambiguity, and what it should do when it encounters a situation it was not designed for. A well-engineered system prompt reads like a specification for a junior engineer on their first week: not the implementation, but the constraints and priorities that shape every decision they will make.

The CLAUDE.md file in Claude Code and the .cursorrules format in Cursor are both institutionalized versions of this idea. They position high-priority context at the beginning of the system tier, where attention weight is highest. This is not a trick; it is the application of what the “lost in the middle” paper from Stanford and UC Berkeley documented about LLM recall degradation across long contexts. Specification that appears at position zero receives more consistent attention than the same content buried in conversation history.

Context architecture. The context window is the agent’s working memory. On any non-trivial task, you will fill it. Reading twenty files of moderate length, accumulating tool results across thirty turns, and maintaining the system prompt alongside conversation history adds up faster than the raw token count suggests, because context pressure degrades decision quality well before the hard limit. The architecture question is what survives as the session grows: which tool outputs get summarized, which get evicted, which get written to external memory and retrieved on demand.

The context anchoring pattern described by Rahul Garg on Martin Fowler’s site addresses one version of this problem: maintaining a living document that captures decisions, constraints, and current scope, periodically re-injected to counter attention drift. This is the agent-system analog of Architecture Decision Records, and it exists because engineers building long-running agents discovered independently that without it, the model loses track of constraints it was given early in the session.

How You Verify: Evaluation Replaces Unit Testing

Unit tests for individual tools are necessary but they test the wrong thing. You can verify that your search_repository tool returns correctly structured results. What you cannot verify with a deterministic test is whether the model calls it at the right time, with reasonable query arguments, and correctly interprets a result that requires domain judgment to use.

The quality question in agentic systems is: does the model make good decisions? That is not a property you can assert with a traditional test. The dominant approach is golden traces: representative scenarios with known-correct action sequences that you compare against. An agent asked to identify and fix a specific class of bug should read the relevant files, not all files; should run tests after the fix, not before; should produce a structured summary, not a conversational response. Whether it does these things across a distribution of inputs is what evaluation measures.

The SWE-bench benchmark, which measures agents against real GitHub issues, has been useful specifically because it forces honest accounting of how error rates compound across multi-step tasks. A model that succeeds 90% of the time at each step of a five-step pipeline succeeds end-to-end roughly 59% of the time. Single-step evaluations hide this. Your evaluation suite needs scenarios that are long enough to expose the compounding.

How You Debug: Traces Replace Logs

When an agentic system produces a wrong result, the question is not where in the code the bug is. The bug is in a decision the model made: the wrong tool called, the wrong argument passed, a misinterpretation of a tool result that sent the subsequent reasoning down a wrong path.

Finding this requires reading the message trace. Every tool call, every result, every intermediate reasoning step is in the conversation history. Observability tools like LangSmith and Weights & Biases Weave treat agent runs as annotatable traces rather than flat log streams because that is the structure that matches the debugging workflow. You are not looking for an exception; you are reading a conversation to find where the reasoning diverged from what you intended.

This changes what you instrument. Traditional logging captures state at explicit points you decided were interesting. Trace inspection exposes the full sequence of decisions. The trace is both the execution log and the specification audit: you can read it to verify whether the agent is following the behavioral constraints you specified or drifting away from them under context pressure.

The Underlying Principle

The pattern across tool design, context architecture, evaluation, and debugging is consistent: you have delegated control flow to a stochastic system, so your engineering effort moves to the places where you can still exert influence. Upstream, in the specifications that shape the model’s decisions. Downstream, in the evaluations and traces that tell you whether those decisions are actually sound.

Willison’s framing of agentic engineering as a discipline rather than a technique is correct because it captures this shift. A technique is a skill applied to a known problem class. A discipline is a body of practice, pattern, and tooling that has accumulated because a problem class turned out to be harder and more consequential than it first appeared. The agent loop is not hard to build. Getting it to behave reliably, securely, and observably across real workloads is the engineering problem, and that is where the discipline lives.