Testing Agentic Systems: Why Your Existing Test Suite Is Not Enough
Source: simonwillison
Simon Willison’s guide to agentic engineering patterns covers a lot of ground, but one area where the implications run deeper than a quick read suggests is evaluation. Testing agentic systems requires a genuinely different approach from testing conventional software, and the gap is not one of tooling. It is structural.
Why Traditional Testing Fails Here
A standard test suite for a web API or a database library works by asserting exact outputs given known inputs. The underlying code is deterministic. You can enumerate the meaningful inputs, specify expected outputs, and get high confidence that the tested behaviors are correct.
Agentic systems break the fundamental assumption behind this. When a language model drives the control flow, the sequence of actions taken to accomplish a task is not fixed. The model decides which tools to call, in what order, with what arguments. A different run of the same system on the same input may produce the same final output through a different action sequence. One run might query a database first and then look up documentation; another might do those in the opposite order. Both might produce a correct final answer.
This creates a two-dimensional testing problem. You need to verify final output quality, but you also need to verify that the action sequence used to produce that output was sound. A correct answer reached through a flawed or risky intermediate process is not a reliable system; it is a system that got lucky.
The ReAct paper from Yao et al. (2022), which formalized interleaved reasoning and acting in LLM systems, observed this implicitly. Their evaluation had to score the quality of reasoning traces, not just final answers. That evaluative complexity is baked into the approach.
What Context Pressure Does to Correctness
Before getting to evaluation strategies, there is a correctness problem that affects agents running on longer tasks: recall degrades over the course of a context window.
The “Lost in the Middle” paper from researchers at Stanford and UC Berkeley showed that LLMs systematically recall information from the beginning and end of long contexts better than information from the middle. The effect was pronounced across models and task types. For a five-step agentic task, facts established in steps two and three are at higher risk of being forgotten or misattributed than facts established in step one or step five.
This means context length is not just a cost variable; it is a reliability variable. An agent that successfully answers a simple question with three tool calls may fail intermittently on a complex task requiring twelve, not because any individual tool is broken, but because the model loses track of constraints established early in the run. Testing only short task scenarios will not surface this.
A minimal harness for catching context-related regressions: run each test scenario at different context depths. Build a version that prepends synthetic tool-result history before the actual task, simulating a long-running session, and verify the agent still respects constraints established before the synthetic history. If it does not, the system has a context management problem that will manifest in production even if short tests pass.
Golden Traces as the Foundation
The most widely adopted evaluation approach for agentic systems is golden traces: a curated set of tasks with expected action sequences that define what a correct run looks like.
A golden trace for a research assistant agent might look like this:
{
"task": "Find the current version of LangChain and summarize the last three changelog entries",
"expected_actions": [
{"tool": "web_search", "contains": "langchain changelog"},
{"tool": "fetch_page", "contains": "changelog"},
{"tool": "final_response", "schema": "ResponseSchema"}
],
"forbidden_actions": [
{"tool": "execute_code"} # no reason to run code for this task
],
"output_checks": [
{"contains": "version"},
{"min_length": 200}
]
}
The expected_actions list is a sequence of soft assertions: the agent should do approximately these things in approximately this order. “Approximately” is the operative word. Strict sequence matching over-constrains the agent; the model may legitimately accomplish the task with a different but valid sequence. The practical approach is to check that required tool classes were called, that forbidden tools were not called, and that the output meets structural criteria, without requiring exact parameter matching.
Building golden traces takes time. A useful heuristic: start with your most important task categories, not your most common ones. A task the agent does a thousand times per day is less likely to go wrong than a task it does ten times per day with high stakes. Weight coverage toward the tails of your task distribution.
LLM-as-Judge: Where It Works and Where It Does Not
For many agentic tasks, specifying expected outputs precisely enough to be testable is not practical. Research questions, summarization tasks, and conversational responses do not have a single correct answer. The evaluator needs to exercise judgment, which is why using a second language model as a judge has become common.
LLM-as-judge works well when the evaluation criteria can be expressed clearly in natural language, when the tasks are diverse enough that manual grading would take too long, and when false negatives (incorrectly flagging a good response) are more acceptable than false positives (missing a bad response). For a customer support agent, a judge model can reliably detect responses that are rude or factually wrong without requiring a human to review every interaction.
The failure modes are well-documented. Models share biases; a judge model from the same family as the agent model may have the same blind spots. A GPT-4-based judge scoring a GPT-4-based agent may consistently rate outputs as acceptable that a GPT-4-based agent consistently produces incorrectly, because both models find the same plausible-sounding wrong answers convincing. The InjecAgent benchmark demonstrated a related problem: agents can be manipulated into producing outputs that appear correct to automated evaluation while actually taking adversarial actions.
The most reliable approach is to calibrate your judge model against a set of human-labeled examples. Anthropic’s guidance on evals recommends starting with at least 50 human-graded examples across the task distribution before trusting automated scoring. Run the judge against those examples, measure agreement, and use the disagreement cases to identify where the judge has systematic errors. This does not eliminate LLM-as-judge failures, but it makes them visible before they cause production problems.
Observability Infrastructure
Trace logging is the prerequisite for all of the above. Without a complete, structured record of each agent run, including every tool call, every model response, and every intermediate message, debugging failures is guesswork.
The production-grade tools for this have specialized to handle agentic workloads. LangSmith treats each agent run as a tree of annotatable spans, making it possible to review the model’s full reasoning for any run, compare runs on the same task, and attach human labels to specific steps. Weights and Biases Weave takes a similar approach and integrates with W&B’s broader experiment tracking infrastructure.
Both tools work by instrumenting the agent loop itself, not just final outputs. For systems not using LangChain, the minimal viable implementation is a logging decorator:
import time
import uuid
from functools import wraps
from typing import Any
def trace_tool(tool_name: str):
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
span_id = str(uuid.uuid4())
start = time.time()
log_span_start(span_id, tool_name, kwargs)
try:
result = fn(*args, **kwargs)
log_span_end(span_id, result, error=None, duration=time.time() - start)
return result
except Exception as e:
log_span_end(span_id, None, error=str(e), duration=time.time() - start)
raise
return wrapper
return decorator
@trace_tool("web_search")
def search_web(query: str) -> list[dict]:
...
The trace ID needs to propagate through the entire run so you can reconstruct the full sequence from logs. Every model API call, every tool invocation, every error should reference the same trace ID. Without this, correlating a bad final output with the specific tool call that caused it requires reading through linear log streams, which does not scale.
Regression Testing After Incidents
The most valuable golden traces are often built after production failures. When an agent takes a wrong action or produces a bad output, that scenario becomes a regression test: encode the task, the conditions, and the bad behavior pattern into a trace that would have caught the failure, then verify it passes after the fix.
This is the same discipline as test-driven bug fixes in conventional software, but applied to probabilistic systems. The important difference is that you cannot assert “the agent will never call this tool in this scenario”; you can only assert that it calls the wrong tool less than a threshold percentage of the time across repeated runs. Agentic regression tests are stochastic by nature. Running each scenario once is insufficient; a reasonable minimum is ten to twenty runs per scenario to catch failure modes that occur at moderate frequency.
For Discord bot development, where the agent might be handling multiple overlapping conversations with different context histories, the trace logging infrastructure pays for itself quickly. The first time an agent misreads a message because an earlier conversation’s context bled into the wrong session, having the full trace makes the bug obvious in minutes rather than hours.
The Infrastructure Investment Is Front-Loaded
The evaluation and observability infrastructure described here is not optional. Systems that ship without it are not more agile; they are less maintainable. When something goes wrong in a production agentic system without trace logging, the investigation is expensive enough that it frequently leads to adding the instrumentation that should have been there from the start, under more pressure than if it had been planned.
The technical debt is asymmetric. Adding golden traces after the system is in production means reconstructing what correct behavior looks like from memory and logs rather than from specification. Adding trace logging after an incident means trying to reproduce the conditions that caused the failure in a system that no longer has the state it had when things went wrong.
Willison’s framing of agentic engineering as a genuine discipline, requiring engineering rigor rather than just prompt iteration, applies most clearly here. The evaluation infrastructure is where that rigor either exists or does not. It is measurable, buildable, and largely independent of which model or framework you are using. Building it early is the lowest-overhead version of having it at all.