· 5 min read ·

Evaluating Agents Is a Different Problem at Every Level

Source: hackernews

Bassim Eledath’s Levels of Agentic Engineering, which landed near the top of Hacker News in early March 2026 with 267 points and 128 comments, gives practitioners something genuinely useful: a shared vocabulary for describing how much autonomy an AI system has within an engineering workflow. The five-level framework maps from stateless prompt-response at Level 1 to multi-agent coordination at Level 5. It is a sensible map of the capability territory.

What the framework doesn’t address is how to verify that a system at a given level is behaving correctly. This is not a minor omission. The evaluation strategies that work at Level 1 fail in hard-to-detect ways at Level 3, and teams that don’t update their testing methodology as they increase agent autonomy tend to discover the gap through user complaints rather than test failures.

Why Standard Tests Break Down

Unit tests verify deterministic functions. Given input X, the function returns Y. You assert Y, the test passes. This model extends reasonably well to Level 1 agents: given a prompt, does the model return something semantically close to the expected response? Embedding-based comparison, golden output sets, and classification accuracy metrics all work here. The iteration loop is fast and the failure signal is clear.

The model starts to strain at Level 2, when tool calls enter the picture. A Level 2 agent might answer a question correctly while calling the wrong tool, or call the right tool with subtly malformed parameters that happen to succeed in the test environment. The output looks right. The path to the output was wrong. In production, slightly different inputs route through that same broken path and produce failures that the output-level test never flagged.

At Level 3, multi-step planning makes output-only evaluation nearly useless as a correctness signal. A 20-step agent run that ends with the correct final answer might have reasoned incorrectly through steps 8 through 14 and recovered by coincidence. Or it might have taken a path that works for this test case but fails on slight variations. Testing only the final output doesn’t distinguish between an agent that reliably solves a class of problems and one that happened to get this particular instance right.

Testing at Each Level

At Level 1, output quality evaluation scales reasonably well. RAGAS provides metrics for retrieval-augmented generation: faithfulness, answer relevancy, context precision. LLM-as-judge approaches, where a separate model evaluates generated outputs against rubrics, handle cases where semantic correctness can’t be captured by similarity scores alone. Tools like deepeval and promptfoo provide test suite infrastructure for this kind of evaluation, with regression testing across prompt changes built in.

At Level 2, trajectory testing matters more than outcome testing. The right evaluation checks not just what the agent returned, but what tools it called, in what order, with what arguments. A mock tool framework that records all calls during a test run, then asserts over the call graph, lets you write tests like: “this query should invoke the database tool exactly once with these parameters, not the filesystem tool.” This is analogous to interaction testing in unit test frameworks, applied to tool call sequences:

class MockToolTracer:
    def __init__(self):
        self.calls = []

    def record(self, tool_name: str, args: dict):
        self.calls.append({"tool": tool_name, "args": args})

    def assert_called(self, tool_name: str, times: int = 1):
        actual = sum(1 for c in self.calls if c["tool"] == tool_name)
        assert actual == times, f"Expected {tool_name} called {times}x, got {actual}x"

    def assert_not_called(self, tool_name: str):
        self.assert_called(tool_name, times=0)

    def assert_args(self, tool_name: str, expected: dict):
        calls = [c for c in self.calls if c["tool"] == tool_name]
        assert calls, f"{tool_name} was never called"
        assert calls[-1]["args"] == expected, f"Args mismatch: {calls[-1]['args']}"

At Level 3, the most useful evaluation distinguishes between outcome correctness and process correctness: whether the agent reached the right end state, and whether it got there via a path that would generalize across similar inputs. AgentBench (Liu et al., 2023) provides structured evaluation across diverse multi-step task categories, including OS interaction, database operations, and code execution. Critically, it measures partial credit for correct intermediate states, which surfaces process failures that pure outcome evaluation misses entirely.

The GAIA benchmark takes a complementary approach: tasks requiring real-world tool use and multi-step reasoning, but with deterministic final answers. The correctness signal is clean; the evaluation pressure is on multi-step reliability rather than output quality. For teams building production Level 3 systems, domain-specific internal benchmarks structured like GAIA, with clear correct answers but requiring multiple steps to reach them, provide more informative signal than generic leaderboard metrics.

At Level 4, with persistent memory, regression testing requires injecting known state and verifying retrieval behavior. Tests should assert not just that the agent recalled something correctly, but that stale or contradictory entries were appropriately weighted down. Memory poisoning tests, where incorrect information is written to the store and subsequent behavior is observed, reveal whether the retrieval strategy is robust or brittle.

At Level 5, multi-agent testing borrows from distributed systems testing methodology. Contract testing verifies that agents communicating over a shared protocol honor their interface contracts under partial failure conditions. Chaos engineering, where individual agents are randomly delayed or silenced, tests whether the broader system degrades gracefully rather than deadlocking or producing inconsistent results. Toxiproxy, originally built for testing microservices under adverse network conditions, applies directly here, and the conceptual toolkit from chaos engineering, fault injection, partition simulation, Byzantine behavior testing, maps onto multi-agent systems with minimal translation.

The Gap Between Tooling and What Teams Need

Evaluation tooling for Level 1 and Level 2 systems is reasonably mature. For Level 3 and above, the gaps are substantial. Most production agent systems are tested primarily through manual end-to-end runs and observation, with automated tests covering unit-level components like individual tool implementations and parsing logic, but not multi-step trajectory correctness.

Part of the reason is that realistic Level 3 test scenarios are expensive to run repeatedly. They require live tool calls, network access, and sometimes stateful external systems. Simulated environments that are cheap to run tend to diverge from production behavior in ways that make test results misleading. The field hasn’t converged on a standard approach to this tradeoff, and the evaluation research that exists (AgentBench, GAIA, WebArena) tends to target research benchmarks rather than the internal evaluation workflows that engineering teams need in CI.

Anthrop’s guidance in Building Effective Agents recommends building evaluation infrastructure before scaling agent complexity: invest heavily in evals before investing heavily in capabilities. That sequencing is correct and widely ignored in practice. The consequence is teams that discover agents are unreliable through production incidents, then have to reconstruct multi-step execution traces from logs that weren’t designed for that purpose.

What This Implies About the Framework

Eledath’s levels taxonomy has a corollary that the original post doesn’t state explicitly: each level transition requires not just new capability infrastructure but new evaluation infrastructure. Moving from Level 2 to Level 3 without updating your testing methodology means extending trust to multi-step execution on the basis of evidence that only validates single-step behavior.

The right framing is to treat evaluation maturity as a prerequisite for operating at a given level, not a follow-up task. Before adding the capability, build the tests that tell you when that capability is failing. The capability and its corresponding eval belong together. Deploying one without the other is how teams end up surprised by production behavior they could have caught before it reached users.

Was this interesting?