Agentic engineering introduces a testing problem that has no clean analog in traditional software engineering. When your system’s core logic is a language model navigating a loop of tool calls, “did it work?” resists the deterministic precision that unit tests require.
Simon Willison’s agentic engineering patterns guide names this challenge explicitly: agentic tasks often have fuzzy success criteria, and the evaluation infrastructure for them is still maturing. The discipline is real, and the evaluation gap is one of the clearest markers of it.
Why Traditional Testing Breaks
A standard unit test: call function with inputs, assert specific output. This works because functions are deterministic. Given the same inputs, you get the same outputs, and assertions can check exact values.
Agentic systems break this in two separate ways.
First, model behavior is stochastic. The same task, the same context, the same tool set; different invocations may choose different tool sequences, ask different clarifying questions, or produce semantically equivalent but textually different outputs. Assertions on exact values fail. Even at zero temperature, minor prompt changes or model updates can shift behavior in ways that pass all existing tests while introducing new failure modes.
Second, success is often not precisely defined. “Did the agent fix the bug?” is answerable by running the test suite. “Did the agent write a good summary?” requires judgment. Most real agentic tasks land somewhere between: clearly correct behaviors, clearly incorrect ones, and a wide middle ground that requires interpretation.
What Replaces Unit Tests
The evaluation approach that has stabilized in agentic engineering combines three things: behavioral benchmarks, LLM-as-judge scoring, and human-graded reference sets.
Behavioral benchmarks measure outcomes rather than intermediate states. SWE-bench is the leading example for coding agents: it measures whether an agent can resolve real GitHub issues by running the repository’s existing test suite against the agent’s changes. The project’s own test suite serves as the correctness oracle, sidestepping the problem of defining what “fixed” means from the outside. Top agents score 45 to 55 percent on the verified subset as of early 2026, a number that carries weight precisely because it measures actual code correctness rather than output appearance.
The catch is that behavioral benchmarks require investment to build. For task types with machine-checkable correctness criteria, you can construct them. For open-ended tasks, you cannot.
LLM-as-judge fills the gap. You run the agent, capture its output, and pass both the task description and output to a second model with an evaluation rubric. The second model scores or classifies the result. Anthropic’s eval documentation recommends this approach for tasks where success is semantically defined but not code-checkable, with the caveat that you need human-graded examples to validate the evaluator’s scoring against human judgment.
The reliability concern here is structural. Models trained on similar corpora can agree on wrong answers. The mitigation is to treat evaluator scores as approximate, calibrate them against human-graded reference examples, and monitor score distributions for drift after model updates or system prompt changes.
Human-graded reference sets anchor both approaches. A corpus of tasks with known-good and known-bad outcomes, graded by people who understand the domain, lets you measure your evaluation pipeline’s precision and recall. You run the automated pipeline against this corpus to detect when scoring has drifted from human judgment, using detection to trigger targeted review.
Observability Is Load-Bearing Infrastructure
In traditional software, debugging means reading logs, setting breakpoints, and tracing deterministic execution paths. In agentic systems, the execution state is the context window: a sequence of text messages. Debugging a failed run means reading that sequence to understand what the model believed at each decision point.
Full trace capture is not optional for any serious agentic deployment. Every tool call, its parameters, its result, and the model response that preceded it needs to be recorded with a consistent trace ID spanning the full run. LangSmith, Langfuse, and the OpenAI Agents SDK’s built-in tracing all model agent runs as distributed traces with parent and child spans. Each inference call is a span with token counts and stop reason; each tool invocation is a span with timing and output.
A minimal structured trace captures enough to reconstruct any run:
import time
def traced_tool_dispatch(name: str, inputs: dict, run_trace: list) -> str:
start = time.time()
result = dispatch_tool(name, inputs)
run_trace.append({
"type": "tool_call",
"tool": name,
"inputs": inputs,
"result": result,
"duration_ms": round((time.time() - start) * 1000),
})
return result
def traced_inference(messages: list, run_trace: list):
start = time.time()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8096,
tools=tools,
messages=messages,
)
run_trace.append({
"type": "inference",
"stop_reason": response.stop_reason,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"duration_ms": round((time.time() - start) * 1000),
})
return response
Token counts in the trace tell you where cost is accumulating. A run consuming 40,000 tokens is diagnosable from the trace: you can see which tool results contributed most to context growth and whether intermediate steps could be compressed earlier. Anthropic’s prompt caching, which reduces repeated stable context to 10 percent of normal token cost on cache hits, becomes more targeted when trace data shows what context recurs across runs.
Detecting Regressions
After a system prompt change or model update, detecting behavioral regressions is the core challenge that traditional testing infrastructure does not address.
A prompt change can shift behavior in ways that pass all behavioral benchmarks while degrading output quality on task types those benchmarks do not cover. The response is to maintain an eval harness that runs a representative sample of production-like tasks after any significant change, scores them with your calibrated pipeline, and flags score distribution shifts for human review. This is structurally similar to how recommendation systems validate ranking model updates offline before shipping: run against representative samples, compare distributions, route outliers to review.
Coverage does not need to be exhaustive. It needs to cover common case types and the specific failure modes already encountered in production. The eval corpus grows from production failures that were consequential enough to review manually, which means coverage naturally expands in proportion to past failure patterns.
What This Means in Practice
The evaluation difficulty in agentic engineering reflects something real: the correctness criteria for agentic tasks are semantically richer than for traditional software, making the evaluation problem harder by design.
The engineering response is proportionality. For high-stakes tasks (production code changes, external API calls, data writes), invest in behavioral benchmarks with machine-checkable correctness criteria plus human-graded reference sets. For lower-stakes tasks (summaries, drafts, exploratory queries), a calibrated LLM-as-judge with periodic spot-checking is proportionate.
The discipline Willison describes requires this tiered evaluation thinking the same way traditional software engineering requires tiered testing: automated checks for the things you can precisely define, human judgment for the rest, and observability infrastructure to know when you need more of either.