Why Coding Agents Work When General-Purpose Agents Don't

The tool loop at the heart of every coding agent, the one Simon Willison describes in his guide on agentic engineering patterns, is not a new idea. It’s the same pattern that animated AutoGPT when it went briefly viral in early 2023, the same architecture behind BabyAGI, the same fundamental structure that a dozen “AI agent” demos used to impressive and then immediately disappointing effect.

Most of those agents failed to produce reliably useful results. Coding agents have not. Understanding why requires looking at what is special about code as a domain, not at what is special about the loop.

Code Has Executable Ground Truth

The most important property of code as a domain is that it has executable verification. When an agent writes a function, you can run it. When it fixes a bug, you can run the test suite. When it refactors a module, the type checker either passes or it doesn’t. These are binary, grounded signals that don’t require human judgment to interpret.

Compare this to other domains where agents have been applied. Research agents that browse the web and synthesize information produce outputs where verification requires the human to already know the answer. Customer service agents that draft responses produce outputs where quality is subjective and slow to evaluate. Medical agents produce outputs where verification might require clinical tests and days of waiting. In none of these cases does the agent receive the kind of immediate, objective feedback that a compiler provides.

This is not a small advantage. It’s the reason the ReAct loop from Yao et al. (2022), the Reason-Act-Observe cycle that most current agents follow, works well for coding. The “Observe” step has substance. When the agent runs pytest and receives a stack trace, it has real, structured information. When it compiles Rust code and gets type errors, it knows exactly what went wrong and where. The feedback loop closes tightly, in seconds, with no ambiguity about what the observations mean.

General-purpose agents typically observe something like “the search returned these results” or “the API call completed.” These observations have low information density relative to the questions the agent is trying to answer. The model is left reasoning about whether its actions moved it closer to the goal with limited feedback. Coding agents observe test outcomes, compiler output, and lint warnings: direct, structured responses to the specific changes the agent made.

The Closed-World Assumption

Code is a closed-world system in a way that most agent tasks are not. A Python file is a complete formal specification of its own behavior. There is no implicit context, no social inference required, no background knowledge that might change what a line of code means. The meaning of:

def authenticate(user_id: int, token: str) -> bool:
    return db.verify_token(user_id, token)

is fully determined by the function body, the types, and the behavior of db.verify_token, which is itself a closed piece of code. Nothing outside the repository affects what this does.

This makes code uniquely tractable for an agent that has a bounded context window. Reading the relevant files actually gives you all the information you need. There is no unwritten context, no meeting-room history, no institutional knowledge that lives only in people’s heads. A sufficiently thorough read of a repository can, in principle, give you a complete picture of what a system does.

This is why context management strategies like Aider’s repo map, which uses tree-sitter to generate a compact structural index of the entire codebase, actually work. The repo map gives the agent a real, complete summary of what exists. If you tried to build a “repo map” for an organization’s decision-making process or a company’s competitive landscape, you would not get the same completeness guarantee. The closed-world assumption that makes the repo map useful is specific to code.

Reversibility and Risk

Another property that makes coding tractable is that most coding operations are reversible. Git exists specifically to enable this. Writing a file, changing a function, deleting a module: all of these are operations that can be undone cheaply and reliably. Aider uses git commits as its primary safety mechanism, committing before every change so there is always an undo path. Claude Code runs in a working directory under version control where any change can be inspected and reversed with git diff and git restore.

Compare this to agents that send emails, place orders, modify databases without transactions, or call external APIs with side effects. Those actions have real-world consequences that cannot be undone with a single command. The asymmetry of risk changes how you build the agent. You either need to confirm every action with the human, which eliminates the value of automation, or you need to be very confident the agent won’t make mistakes, which isn’t yet achievable at the level of precision that consequential actions require.

Code, by default, sits in a cheap-to-revert storage layer. The worst case of an agent mistake is “there’s a bad change that needs to be reverted,” which is both detectable, through tests, and correctable, through version control. This is why coding agents can afford to be more autonomous than most agents should be. The blast radius of any given mistake is structurally bounded by the tooling before you’ve even written a single line of agent code.

Why AutoGPT Failed

It’s worth being specific about what went wrong with first-generation general-purpose agents to understand what coding agents got right. The failure modes were consistent across nearly all of them:

First, no ground truth. Tasks like “research my competitors” or “plan my vacation” have no executable verification step. The agent could not tell whether it was making progress or spinning in place. Every turn required the model to judge the quality of its own previous work, which models in 2023 were not reliable at.

Second, unbounded scope. Without a well-defined task boundary, agents would continue taking actions indefinitely, accumulating costs and often drifting far from the original goal. AutoGPT was notorious for deciding mid-task that it first needed to understand philosophy in order to accomplish a practical request.

Third, no reversibility. Tool calls that sent emails, created calendar events, or posted to external services could not be undone. This forced a binary choice between constant human confirmation prompts and occasional catastrophic irreversible mistakes.

Fourth, high noise in observations. Web search results and scraped page content are variable in format, relevance, and reliability. Extracting signal required more reasoning per turn than the models available at the time could perform consistently.

Coding agents addressed each of these structurally. The task boundary is usually “fix this bug” or “implement this feature,” which has a clear completion condition in the form of passing tests. The operations are reversible through version control. The observations are structured and high-information. The context, the codebase, is closed and legible.

Where Else This Logic Applies

Understanding why coding works is also a lens for evaluating which other domains will support effective agent operation.

SQL database administration shares several of coding’s useful properties: operations are transactional and reversible, there is ground truth (queries return correct or incorrect results), and the domain is closed in the sense that the schema is a complete specification of the data model. Agents extended to database work should be more tractable than agents extended to content moderation or market research.

Infrastructure as code, writing Terraform or Ansible configurations, shares similar properties. The system state is declarative, verifiable through plan output, and revertable through version control. Agents for this domain should work reasonably well for the same reasons coding agents do.

Research synthesis and knowledge work, generating reports, drafting strategy documents, answering analytical questions, remain genuinely hard because none of these properties hold cleanly. Verification is subjective, scope is open-ended, and the closed-world assumption breaks down completely. The agent has no mechanism to know whether it’s done or whether it’s producing useful output.

The Practical Implication for Codebases

If you’re building systems that use coding agents as components, the practical implication follows directly from the analysis above: the quality of your verification layer determines the effective capability of the agent.

A coding agent working in a codebase with comprehensive, fast tests has more information per turn than the same agent working in a poorly tested codebase. Tests are not just safety nets for humans reviewing code; they are the feedback mechanism that makes the agent’s Observe step meaningful. An agent without runnable tests is an agent with significantly degraded observability into its own behavior, relying on its own reasoning about what it changed rather than on grounded, executable evidence.

The same logic extends to type systems, linters, and static analysis tools. Every tool that produces structured, machine-readable feedback about the state of the code is additional signal the agent can use to verify its own work. Investing in these tools was already valuable for human developers. In an agentic workflow, they become load-bearing infrastructure.

The loop Simon Willison describes is simple and well-understood at this point. The interesting work is in understanding which domains make the loop useful and which make it unreliable. Code has the right properties. That is not an accident; it is a structural feature of the domain that explains why coding was one of the first practical applications of agentic AI, and why it remains one of the most reliable.