· 9 min read ·

Testing Agents When the Path Is Variable and Only the Outcome Matters

Source: simonwillison

The previous posts in this series covered the feedback loop as architecture, the scaffolding that surrounds it, and the tool design decisions that determine whether an agent actually works. There is one more layer that separates a working agent from a production system: knowing whether it works at all, and having a rigorous way to verify that it still works after you change something.

This is the evaluation problem, and Simon Willison’s guide on agentic engineering patterns puts it clearly: the bottleneck for improving agents is usually evaluation quality, not model capability. Most teams hit this wall later than they should, after building substantial systems with no reliable way to measure whether a change made things better or worse.

Why Unit Tests Do Not Generalize

Unit testing assumes a deterministic mapping from input to output. You call a function with known arguments, assert the return value, and the test either passes or fails. The path through the code is implicit; what matters is the final result, and because the code is deterministic, the same input always produces the same path.

An agentic system breaks both halves of this. The execution path is variable: the same user request might resolve through three tool calls or eight, depending on what the model finds along the way, what the tools return, and how the context accumulates. Two successful runs of the same task may share nothing in terms of intermediate steps. And the “output” is often a natural-language artifact whose correctness cannot be determined by string equality.

You can write assertions about specific tool calls if you mock the model’s behavior, but now you are testing your mock, not your agent. You can test individual tools in isolation, and you should, but that is not the same as testing the agent that uses them. The agent’s behavior emerges from the interaction between the model, the tools, and the context, and none of those in isolation tells you whether the whole thing works.

The practical replacement for unit testing is outcome evaluation: you define what a correct outcome looks like for a given task, run the agent against that task, and judge whether the outcome meets the definition. The path the agent took to get there is a secondary concern, useful for debugging but not the primary verification target.

Two Layers: Outcomes and Trajectories

Outcome evaluation and trajectory analysis are distinct and complementary.

Outcome evaluation asks a simple question: did the agent complete the task correctly? For a coding task, did the produced code pass the test suite? For a research task, does the final answer contain accurate information and address what was asked? For a data extraction task, does the output match the schema and contain the expected values? Outcome evaluation is the primary signal for whether the system works.

Trajectory analysis asks a different question: how did the agent get there, and was the path reasonable? A correct outcome reached through an unnecessarily long sequence of tool calls, or through a sequence that happened to work on this example but would fail on a slight variation, is more fragile than a correct outcome reached through a coherent, minimal path. Trajectory analysis catches agents that are right for the wrong reasons, which matters when you are trying to generalize beyond your evaluation set.

Both layers require deliberate investment. Most teams start with outcome evaluation because it is more directly tied to the question of “does this work.” Trajectory analysis becomes important once outcome accuracy is high enough that further gains require understanding where the agent is being inefficient or lucky.

Benchmarks and What They Measure

The research community has developed several standardized benchmarks for evaluating agents on defined task distributions.

GAIA, released by Meta in 2023, contains 466 real-world tasks that require multi-step reasoning, web browsing, and tool use. The tasks are stratified by difficulty across three tiers. On the hardest tier, GPT-4 with browsing scored around 32% while human performance sits around 92%. That gap is instructive: the benchmark was designed specifically to resist brute-force LLM capability and require genuine multi-step reasoning combined with tool use.

SWE-bench evaluates agents on GitHub issue resolution: given a real-world issue from an open-source repository, can the agent produce a patch that resolves it? The Verified subset filters out ambiguous or underspecified issues to make the benchmark more reliable. Claude 3.7 scored 70.3% on SWE-bench Verified, which is a meaningful result for a benchmark that requires reading codebases, understanding issue reports, writing code, and running tests.

WebArena takes a different approach, using simulated web environments with realistic task structures, things like “find the cheapest flight from Boston to Seattle next Tuesday and book it.” This tests the grounding of agent behavior in realistic interface interactions rather than clean API calls.

These benchmarks matter for calibrating expectations and for comparing systems on defined axes. They are less useful for evaluating your specific agent on your specific task distribution, which is where domain-specific evaluation sets become essential.

LLM-as-Judge

When outcomes are natural-language artifacts, you need a way to evaluate them that scales beyond human review of every output. The approach that has become standard practice is LLM-as-judge: use a language model to evaluate the output of another language model against a rubric.

This was formalized in the MT-bench paper by Zheng et al. (2023), which studied how well language models could serve as evaluators for open-ended question answering. The paper found that pairwise or comparative judgments, “which of these two answers is better and why,” are substantially more reliable than absolute scoring, “rate this answer from 1 to 10.” Absolute scores suffer from rubric drift: what counts as a 7 versus an 8 varies across evaluations in ways that are hard to control.

There are known failure modes to design around. Sycophancy: LLM judges tend to prefer longer, more confident-sounding answers regardless of accuracy, and may reward stylistic fluency over factual correctness. Self-evaluation bias: if you use the same model family to produce outputs and evaluate them, you may be measuring model self-consistency rather than actual quality. Positional bias: some models show a preference for the first answer in a pairwise comparison simply due to its position.

The practical mitigations are to use a different model for evaluation than for generation, to use pairwise comparison rather than absolute scores wherever possible, to rotate the order of answers in pairwise comparisons and average the results, and to include adversarial examples in your evaluation set where you know the correct answer is the shorter or less confident-sounding one.

Trace-Based Debugging

When an agent fails, you need to be able to reconstruct exactly what happened. This requires structured traces, not flat log files.

A complete trace for an agent run should capture, for each step: the tool name, the full input arguments, the full return value, and the latency from call to return. It should capture the full model response including any reasoning the model exposes, the input and output token counts, and the estimated cost. At the run level, it should capture the initial task, any configuration parameters, the total wall time, and whether the run succeeded or failed.

OpenTelemetry provides a vendor-neutral format for structured traces with parent-child span relationships, which maps naturally to the nested structure of an agent run with tool calls as child spans. Platforms like LangSmith, Braintrust, and Weights and Biases Weave all provide agent tracing built on top of or compatible with this format.

Here is a tracing wrapper for the Anthropic SDK agent loop that captures what you need for post-hoc debugging and evaluation:

import time
import uuid
from dataclasses import dataclass, field
from typing import Any
import anthropic

@dataclass
class ToolCallSpan:
    tool_name: str
    input_args: dict
    return_value: Any
    latency_ms: float
    error: str | None = None

@dataclass
class AgentTrace:
    run_id: str
    task: str
    steps: list[ToolCallSpan] = field(default_factory=list)
    model_responses: list[dict] = field(default_factory=list)
    total_input_tokens: int = 0
    total_output_tokens: int = 0
    final_output: str | None = None
    success: bool = False
    wall_time_ms: float = 0.0

def run_agent_with_trace(task: str, tools: list[dict], dispatch_tool) -> AgentTrace:
    client = anthropic.Anthropic()
    trace = AgentTrace(run_id=str(uuid.uuid4()), task=task)
    messages = [{"role": "user", "content": task}]
    run_start = time.monotonic()

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        trace.total_input_tokens += response.usage.input_tokens
        trace.total_output_tokens += response.usage.output_tokens
        trace.model_responses.append({
            "stop_reason": response.stop_reason,
            "content_types": [b.type for b in response.content],
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
        })

        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    trace.final_output = block.text
                    break
            trace.success = True
            break

        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue

            call_start = time.monotonic()
            error = None
            result = None
            try:
                result = dispatch_tool(block.name, block.input)
            except Exception as e:
                error = str(e)
                result = f"Error: {error}"

            latency_ms = (time.monotonic() - call_start) * 1000
            trace.steps.append(ToolCallSpan(
                tool_name=block.name,
                input_args=block.input,
                return_value=result,
                latency_ms=latency_ms,
                error=error,
            ))
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result),
            })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})

    trace.wall_time_ms = (time.monotonic() - run_start) * 1000
    return trace

The AgentTrace object you get back is the unit of analysis for your evaluation pipeline. You can serialize it to JSON, store it alongside the expected outcome for a given task, and feed it to an LLM judge that evaluates both the final output and the trajectory. The latency on each tool call surfaces slowness that would be invisible in aggregate timing. The full input and return values for each tool call let you reconstruct exactly what the model saw at each decision point.

Eval-Driven Development

The evaluation problem has a development workflow implication that is easy to skip over: you should collect examples before you build.

The pattern that works in practice is to gather 20 to 50 representative examples of the task you want your agent to handle, with known correct outcomes, before writing a single line of agent code. Define success criteria for each example explicitly. Then build the agent and run the evaluation set after every significant change.

This sounds obvious, but most agent projects do the opposite. They build first, test manually on a handful of examples, demo successfully, and then discover at scale that the agent fails in ways that were never captured during development. Retrofitting an evaluation set onto a production system is harder than building one from the start, because you have to extract representative examples from production traffic rather than designing them deliberately.

The other benefit of starting with an evaluation set is that it clarifies what you are actually trying to build. Writing down explicit success criteria for 30 examples surfaces ambiguities in the task definition that would otherwise only appear after deployment. If you cannot write down what a correct outcome looks like for an example, you cannot evaluate it, and if you cannot evaluate it, you cannot reliably build toward it.

The Actual Bottleneck

The point in Willison’s guide that deserves more weight than it usually gets is that evaluation quality, not model capability, is the primary bottleneck for improving agents in production. This is a practical claim about where time is well spent.

If you have a reliable evaluation set and a systematic way to measure outcomes, you can iterate quickly: change a prompt, run evals, see whether accuracy went up or down, decide whether to keep the change. If you lack that infrastructure, every change is a guess, confirmed by manual inspection of a few examples and intuition about whether things seem better. That workflow does not scale to production systems, and it does not catch regressions.

The field has the benchmarks. The tooling for trace capture and LLM-as-judge evaluation is available. What most teams lack is the discipline to build the evaluation infrastructure before it becomes urgent, which is the same discipline that separates teams who write tests before shipping features from those who add tests after the first production incident.

Building that infrastructure early is the single highest-leverage investment in agentic system reliability, more durable than any prompt change and more general than any model upgrade.

Was this interesting?