Code Gives Agents Something General-Purpose Agents Rarely Have: Ground Truth
Source: simonwillison
The feedback problem that general-purpose agents cannot solve
Most discussions of coding agents focus on the loop: send messages to a model, execute the tool calls it returns, append results, repeat. Simon Willison’s guide to agentic engineering patterns describes this well, and the mechanics are genuinely simple. What those mechanics do not explain is why coding agents are measurably more reliable than agents working on open-ended tasks like research synthesis or document drafting.
The answer has less to do with the model and more to do with what code provides that prose does not: ground truth.
When a coding agent edits a file and runs the test suite, it gets back a deterministic signal. Tests pass or they fail. The compiler accepts or rejects. The linter counts violations. These signals are precise, they are generated by the environment rather than estimated by the model, and they give the agent a basis for deciding whether to continue or stop. An agent working on a prose task has no equivalent. It can ask the model to evaluate its own output, but that evaluation is itself a model output, which means errors compound rather than cancel.
What ground truth changes about the loop
Consider what happens when a coding agent makes a mistake. It edits a file incorrectly. The next step in most coding agent configurations is to run a verification step: compile, lint, or test. The result comes back as structured tool output with an exit code, stdout, and stderr. The model reads the error, identifies the problem, and generates a correction. This loop, edit-verify-correct, converges on working code without any human intervention because each iteration has an objective stopping condition.
This is not how general-purpose agents work. A research agent summarizing sources cannot run its summary through a verifier. An email-drafting agent cannot compile its output. Even a relatively well-structured prose task, like following a style guide, lacks the hard failure signal that turns agent behavior from exploration into convergence.
The edit-verify-correct loop in code was not invented by LLM agents. It is the normal software development cycle, formalized and tightened. What coding agents add is the ability to drive that cycle without waiting for a human to interpret each result.
How tests serve as convergence signals
Well-tested codebases are better substrates for coding agents than poorly tested ones, by a large margin. A codebase with comprehensive tests lets the agent treat passing tests as a proxy for correctness. It does not need to reason from first principles about whether its changes are right; it can observe whether the environment accepts them.
Aider exploits this with a --test-cmd flag. When specified, Aider runs the test command after each edit and automatically retries if it fails. The agent loops until tests pass or the iteration budget is exhausted. This behavior is only possible because tests produce a binary signal. Aider does not need to evaluate whether the output is good; it reads the exit code.
Claude Code takes the same approach more interactively. When asked to fix a bug, it typically reads the relevant files, makes a change, runs the affected tests, and decides whether to continue based on the result. The model’s judgment still matters, but it is applied on top of objective feedback rather than instead of it.
The practical implication for anyone maintaining a codebase that agents will touch: investment in test coverage is investment in agent reliability. Untested code behaves differently under agentic editing not because the model acts differently, but because the agent has no stopping condition and cannot distinguish a good change from a bad one through runtime feedback alone.
The role of file system state
Beyond test results, the file system itself is a form of ground truth. Files have content that can be read back after editing. An agent can write a change, read the file again, and verify that what it wrote is what it intended. This sounds trivial but it rules out a class of failure common in language model outputs: silent inconsistency, where the model produces output that looks correct but contains small deviations from intent that are only visible on close inspection.
The str_replace edit tool design is built around this property. Rather than asking the model to reproduce an entire file with changes, str_replace takes the old content and the new content for a specific region. If the old content is not found in the file, the tool returns an error. This means the model cannot hallucinate a successful edit; the file system enforces consistency between the model’s mental state and the actual file.
This is a design pattern worth naming: use the environment as a validator rather than asking the model to validate its own output. The environment is cheaper, faster, and more reliable for structural checks. Model judgment should be reserved for decisions that are genuinely semantic, like whether a change is architecturally correct.
Where domain constraints tighten the loop
Coding as a domain has several structural properties that make agentic systems tractable:
First, the state is explicit. All relevant state for most coding tasks lives in files that can be read. There is no hidden state to infer from context cues. The agent can always read the file and know exactly what is there.
Second, changes are local and reversible. Editing a function affects that function. Reverting is a single git command. The blast radius of most coding operations is bounded and undoable, which makes it safe to attempt an operation and observe the result rather than planning exhaustively before acting.
Third, the success condition is often formalizable. “Make the tests pass” is a complete specification for a large class of bug-fix tasks. “Implement this interface” can be checked by compiling the code against the interface definition. Many coding tasks can be reduced to a form where success is verifiable by machine.
General-purpose tasks lack these properties. Open-ended tasks have implicit state distributed across sources the agent cannot read. Their effects may be irreversible. Their success conditions are often undefined or subject to human judgment. These differences do not make general-purpose agents impossible, but they do make the edit-verify-correct loop unavailable, which removes the primary mechanism that makes coding agents reliable.
The limits of the structural advantage
This structural argument has edges. Not all coding tasks are well-specified. Refactoring for readability has no test. Architecture decisions are not compile-checkable. Tasks that cross file boundaries and require maintaining conceptual consistency across a large codebase exceed what test coverage can verify locally.
For these tasks, coding agents behave more like general-purpose agents: they produce output that looks plausible and may contain subtle errors that only manifest in context the agent did not check. Claude Code’s subagent mechanism is partly a response to this. Spawning a fresh context for a bounded subtask keeps each agent instance working on a problem small enough that the structural advantages hold, even if the parent task is too large for any single agent to verify end-to-end.
GitHub Copilot’s agent mode adds semantic retrieval via vector indexing, which helps with cross-file consistency by surfacing related code automatically. But embedding similarity is approximate, so this is a probabilistic improvement rather than a structural one. It reduces the chance the agent misses relevant context; it does not eliminate the possibility.
The honest summary is that coding agents are reliable for tasks where ground truth feedback is available, and substantially less reliable for tasks where it is not. The loop mechanics are the same. What differs is whether the environment can tell the agent when it is wrong.