· 7 min read ·

Why Verification Subagents Need Independent Context to Be Useful

Source: simonwillison

Simon Willison’s agentic engineering patterns guide introduces verification subagents as one of three core task profiles: a separate agent that receives a description of what should be true and uses read and execution tools to confirm it, without writing anything. The guide mentions this as a check against the principal-agent problem: you cannot fully trust that a subagent did exactly what you asked, so you have something else confirm it.

But there is a more specific reason why this works, and it has to do with what each subagent receives at invocation time.

The Problem with Self-Verification

The straightforward alternative to a verification subagent is to ask the implementer to check its own work. At the end of the implementation task, you prompt the agent: “Review what you just produced and confirm it meets requirements.” This seems efficient; you save an API call, avoid context construction overhead, and the same model is already familiar with the task.

The problem is structural. An implementer that verifies its own output is reviewing the product of its own reasoning chain. If the implementer made a false assumption early in its work, that assumption persists in its context. When it reviews the output, it applies the same reasoning that produced the errors. It is likely to consider its output correct precisely because the output is consistent with the assumptions it made.

This is not an LLM-specific failure mode. Human engineers reviewing their own code struggle with the same bias. You expect the code to work because you understand the reasoning that produced it; your mental model of what the code does is the same reasoning you used to write it. An independent reviewer approaches the code without that prior, which is why code review catches different errors than authorial review.

For subagents, the atom model solves this problem structurally rather than through prompting.

What the Verifier Actually Receives

In an atom-model multi-agent system, the verification subagent does not receive the implementer’s context. It receives the output of the implementer’s work: the modified files, the test results, the diff. This is not a prompt engineering choice; it is the default behavior of a system where each subagent call is self-contained.

The verifier starts from scratch. It has no knowledge of what the implementer tried before settling on the current approach, no knowledge of which design choices were considered and rejected, no knowledge of the reasoning chain that produced the output. It has the output itself and the requirements it is supposed to meet. That independence is the source of its value.

Here is what this looks like with the Anthropic SDK. The orchestrator calls two subagents sequentially: one to implement, one to verify.

import anthropic
import json

client = anthropic.Anthropic()

def run_implementation_subagent(task: str, files: dict) -> dict:
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8096,
        system="You implement the described task. Return a JSON object with keys: files_modified (list of paths), diff (unified diff string), tests_command (string to run tests).",
        messages=[{
            "role": "user",
            "content": f"Task: {task}\n\nFiles:\n{json.dumps(files, indent=2)}"
        }],
        tools=[{
            "name": "submit_implementation",
            "input_schema": {
                "type": "object",
                "properties": {
                    "files_modified": {"type": "array", "items": {"type": "string"}},
                    "diff": {"type": "string"},
                    "tests_command": {"type": "string"}
                },
                "required": ["files_modified", "diff", "tests_command"]
            }
        }],
        tool_choice={"type": "tool", "name": "submit_implementation"}
    )
    return response.content[0].input

def run_verification_subagent(requirements: str, implementation: dict, test_output: str) -> dict:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # cheaper model, read-only task
        max_tokens=2048,
        system="You verify whether an implementation meets stated requirements. You have no knowledge of how the implementation was produced. Evaluate only what is in front of you.",
        messages=[{
            "role": "user",
            "content": (
                f"Requirements:\n{requirements}\n\n"
                f"Diff:\n{implementation['diff']}\n\n"
                f"Test output:\n{test_output}"
            )
        }],
        tools=[{
            "name": "submit_verdict",
            "input_schema": {
                "type": "object",
                "properties": {
                    "passes": {"type": "boolean"},
                    "issues": {"type": "array", "items": {"type": "string"}},
                    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
                },
                "required": ["passes", "issues", "confidence"]
            }
        }],
        tool_choice={"type": "tool", "name": "submit_verdict"}
    )
    return response.content[0].input

The verifier’s system prompt explicitly states it has no knowledge of how the implementation was produced. This is trivially true by design, not by instruction: the verifier has never seen the implementer’s context. The instruction reinforces the intended behavior but is not load-bearing in the same way it would be if both operations were in the same context.

The Model Size Asymmetry

Notice that the implementation subagent uses claude-opus-4-6 and the verification subagent uses claude-haiku-4-5-20251001. This is not arbitrary. Implementation requires reasoning about the design space, making judgment calls about approach, and producing code that is both correct and consistent with surrounding conventions. That benefits from a larger model’s capacity.

Verification requires something different: checking that a defined output satisfies defined criteria. Given a diff and a set of requirements, the verifier needs to read carefully and identify discrepancies, not design anything. Haiku handles this well at a fraction of the cost.

This asymmetry is only visible when the tasks are separated. If you ask a single agent to implement and then self-verify in the same context, you cannot route different parts of the work to different models. The atom model exposes the task structure in a way that makes model selection a real architectural lever.

What the Verifier’s Tools Reveal

The minimal footprint principle holds that you should give each subagent only the tools it needs. For a verification subagent, the right tool set is read-only: file reads, grep, command execution for running tests. No write access to files, no access to external APIs, no ability to create PRs or send notifications.

This restriction matters beyond security. A verifier with write access might “fix” issues it finds rather than reporting them. This produces a system where failures are silently corrected rather than surfaced to the orchestrator. The orchestrator loses visibility into what actually happened.

More subtly, a verifier that can modify the implementation is no longer providing independent verification. It has become a second implementer. The result may be correct, but you have lost the epistemic benefit of independent review: the orchestrator cannot distinguish between “the implementation was correct” and “the verifier fixed problems the implementer introduced.”

Tool restrictions enforce the intended role in a way that prompt instructions alone do not.

Tiered Verification

For work where correctness matters significantly, the verifier pattern extends to multiple independent passes at different confidence levels.

A first pass at the haiku tier catches obvious mismatches: the requirements say the function should return a boolean; the diff shows it returns a string. Cheap to run, catches gross errors.

A second pass at the sonnet tier reviews the diff more carefully for logical correctness, edge cases, and requirement coverage. More expensive, catches subtler issues.

A third pass, when warranted, involves actually running the code and examining test output against expected behavior. This is deterministic where the model judgment calls are probabilistic.

Each pass is independent; each verifier sees only the implementation output and the requirements, not the other verifiers’ conclusions. If you fed the first verifier’s output to the second verifier, you would partially reintroduce the dependency problem: the second verifier might anchor on the first verifier’s conclusions rather than forming independent judgment.

This is the same principle that motivates keeping subagent contexts separate in the first place. Independent judgment requires independent context.

What This Means for Orchestrator Design

The orchestrator in this pattern has a specific responsibility: it decides what count as requirements, what counts as the implementation output to verify, and what to do with a failure verdict.

On a failure verdict, the orchestrator can retry the implementation subagent with the verifier’s issue list included in the task context. This is the key feedback loop. The implementer does not see the verifier’s judgment in the same context; it receives it as structured input on a fresh invocation. The implementer’s next attempt benefits from knowing what the verifier found wrong, but it still reasons from scratch about how to fix it, rather than patching the previous implementation while carrying all the same assumptions.

The feedback loop is: implement, verify, if failed then implement again with structured issue list, verify again. This terminates when the verifier passes or when the orchestrator decides to escalate.

Feedback loops like this are a core pattern in Simon Willison’s broader agentic engineering guide. The atom model is what makes the loop semantically clean: each iteration is a fresh context on both sides of the loop, and the information transfer between iterations is explicit and structured rather than implicit and accumulated.

The Trust Calculation

The underlying principle here is that you cannot fully trust any single subagent’s self-report about whether it succeeded. This is the principal-agent problem: the agent’s interests and incentives may not perfectly align with the principal’s, and even when they do, the agent’s judgment about its own work is not independent.

For LLMs, this manifests as a consistent tendency to produce outputs that appear to satisfy requirements, including in cases where careful reading reveals they do not. The model generates text that is plausible given its understanding of the requirements, but plausible is not the same as correct.

An independent verifier with its own context applies different attention to the output. It has no memory of why certain choices were made; it just sees whether the output does what it claims to do. In practice, this catches a meaningful portion of the errors that self-verification misses, particularly the subtle ones where the implementation is internally consistent but misses a requirement that was not central to the implementer’s working model.

Was this interesting?