· 6 min read ·

LLMs as Soft Oracles: What the Agentic Manual Testing Pattern Actually Solves

Source: simonwillison

Simon Willison published a guide on agentic engineering patterns in early March 2026, and one section has been sitting in the back of my head since I first read it: agentic manual testing. The pattern is simple to describe but has implications that run deeper than the surface reading suggests.

The premise is that instead of writing automated tests or hiring QA engineers to click through your application, you give an LLM agent access to a running instance of your software and ask it to explore, report what it finds, and flag anything that looks wrong. The agent uses tools, browser automation, or computer use to interact with the UI or API the same way a human tester would.

That summary makes it sound like a cost-cutting measure, but the more interesting framing is about the oracle problem in software testing.

The Oracle Problem

Testing theory has a concept called the test oracle: the mechanism by which you determine whether a test passed or failed. For unit tests, the oracle is explicit, an assertion like assertEqual(result, 42). For property-based testing tools like Hypothesis in Python or QuickCheck in Haskell, the oracle is a property you define: “this function should never return a negative number” or “serializing and deserializing should be the identity function.”

The oracle problem is this: many software defects are hard to express as explicit oracles. When a user complains that “the dashboard feels broken,” or “something looks off on mobile,” or “the error message doesn’t make sense,” there is no clean assertion to write. The oracle in those cases is a human who has internalized what the application is supposed to do and can recognize a violation of that implicit contract.

LLMs are surprisingly good at this. They carry a rough model of how applications are supposed to behave, what error messages should look like, what a confusing UI flow feels like, and when a response seems semantically wrong even if it passes type checks. This is what Willison’s pattern leverages. The LLM is not writing assertions, it is acting as a soft oracle.

How It Works in Practice

The implementation involves giving an agent a set of browser automation tools and a task like “test the checkout flow” or “explore the settings panel and report anything that seems wrong.” The agent navigates the application, takes actions using tools (click, type, navigate, read DOM), observes results, decides whether anything is anomalous, and reports findings in natural language.

A minimal version using Playwright as the browser automation layer and a tool-calling LLM might look like this:

tools = [
    {"name": "navigate", "description": "Navigate to a URL"},
    {"name": "click", "description": "Click an element by CSS selector"},
    {"name": "type", "description": "Type text into an input"},
    {"name": "read_page", "description": "Return the page title, visible text, and any console errors"},
    {"name": "screenshot", "description": "Take a screenshot and return it as base64"},
    {"name": "report_issue", "description": "Record a potential bug with description and severity"},
]

system_prompt = """
You are a QA engineer testing a web application.
Navigate the application, interact with it as a user would,
and use report_issue() to record anything that looks broken,
confusing, or incorrect. Be specific about what you expected
vs what you observed.
"""

The agent runs in a loop until it decides it has explored enough or hits a turn limit. The output is a list of natural language bug reports.

What makes this useful is that the agent can notice things that are hard to encode as assertions: a form that submits but shows no confirmation, an error message that references an internal exception class, a button that appears grayed out but is still clickable, a loading spinner that never disappears. These are real bugs, and none of them require a test oracle you could have written in advance.

Where This Sits in the Testing Pyramid

The traditional testing pyramid puts unit tests at the base (many, fast, cheap), integration tests in the middle, and end-to-end tests at the top (few, slow, expensive). Agentic manual testing does not replace any of those layers; it sits above the pyramid.

The pyramid layers test what you specified. Unit tests check that your functions do what you said they should. Integration tests check that components interact correctly. E2E tests check that the user flows you scripted work as intended. None of them test whether the application makes sense to a human who has not read the spec.

Agentic testing fills that gap. In spirit, it is closer to exploratory testing, a practice from the QA community where testers are given time to roam through an application without a test script, using their judgment to find issues that scripted tests miss. The insight in Willison’s pattern is that LLMs can approximate that exploratory judgment at a fraction of the cost.

That approximation is imperfect. An LLM agent will not catch race conditions, will miss subtle visual regressions unless you give it good screenshot diffing, and has no memory between sessions unless you explicitly manage context. It will also generate false positives, flagging behavior as buggy when it is intentional.

The Flakiness Problem

Deterministic tests fail for a reason. Agentic tests fail for reasons that might include: the model chose a different path through the UI, the model interpreted something ambiguously, the model’s context filled up mid-session, or the model’s behavior shifted between API versions.

This makes agentic tests difficult to integrate into CI pipelines the same way you would integrate pytest or Jest. A failing agentic test run is a signal worth investigating, but it is not the same signal as a failing assertion. The right mental model is probably closer to a linter with a high tolerance threshold: you run it, you look at what it found, and you triage the results rather than treating every finding as a hard blocker.

Some teams work around flakiness by running multiple agent sessions and looking for consistent findings across runs. A single session flagging something is weak evidence. Three sessions independently flagging the same issue is much stronger. This is expensive but tractable for release validation specifically.

There is also a more fundamental issue: when an agentic test run reports no problems, what does that tell you? With a deterministic test suite, a green run has well-defined coverage semantics. With an agentic run, the agent may simply not have explored the code path where the bug lives. The absence of findings is weak evidence of absence. This is the same limitation human exploratory testers face, but it is worth keeping in mind when you are deciding how much weight to put on a clean agentic test result.

The Cost Dimension

Running an LLM agent through a non-trivial UI test session costs real money. A thorough session with a capable model might consume tens of thousands of tokens, especially if you include screenshots with multimodal models. At current API pricing this is cheap enough to be practical for pre-release checks but too expensive to run on every commit.

The economics favor using agentic testing at inflection points: before a major release, after significant UI changes, when you have merged work from multiple branches and want a sanity check. It is a complement to your fast, cheap automated tests, not a replacement for them.

Smaller models can reduce cost significantly. A model that handles browser navigation well but has weaker judgment costs less per session and can still catch obvious breakage. The sensible approach involves tiered testing: a cheaper model for broad coverage, a more capable model for high-priority flows or anything the first pass flagged as suspicious.

What This Means for How We Write Software

One underappreciated consequence of agentic testing is that it creates pressure to write software that is legible to an LLM agent. Applications with clear, descriptive labels, meaningful error messages, and consistent behavior are easier to test this way, just as they are easier for human testers. An agent cannot understand what a button does if its label is “Submit” in seventeen different contexts.

This aligns with accessibility guidelines, semantic HTML, and good UX practice in general. A side effect of adopting agentic testing may be that it surfaces legibility problems that matter for real users too, since the same ambiguities that confuse the agent also confuse users who are not familiar with the application.

Willison’s guide frames agentic manual testing as one pattern among several in a broader agentic engineering approach, but it is the one that requires the least infrastructure to try. You do not need a full agentic development workflow to experiment with pointing an LLM at a staging environment and asking it what looks wrong. The barrier to a first experiment is low, and the feedback it generates, even from a brief session, tends to surface exactly the kind of contextual, emergent issues that automated test suites miss by design.

Was this interesting?