Simon Willison’s guide to agentic engineering patterns covers the fundamentals of how coding agents operate. What it points at, without fully dwelling there, is a deeper question: why do coding agents work reliably when general-purpose LLM agents have historically been fragile? The answer is architectural, not about model capability.
The Basic Loop
Every coding agent, whether it is Claude Code, Cursor’s agent mode, Aider, or Devin, runs the same fundamental loop. The model receives a context window containing a system prompt, conversation history, and accumulated tool results. It produces a response. If that response contains a tool call, the scaffolding executes it and appends the result to the conversation. The model is called again. This repeats until the model produces a response with no tool calls.
In pseudocode it looks like this:
history = [system_prompt, user_message]
while True:
response = llm(history)
if not response.tool_calls:
return response.text
for call in response.tool_calls:
result = execute_tool(call.name, call.input)
history.append(tool_result(call, result))
This pattern, sometimes called the ReAct loop after the 2022 paper by Yao et al., is not novel. What matters is what tools get plugged into it and what the execution environment looks like.
For a coding agent, the tool surface is intentionally narrow. Claude Code exposes bash (a persistent shell), read_file, write_file, edit (targeted string replacement), glob, grep, and a handful of utilities. Cursor’s agent mode provides read_file, edit_file, run_terminal_command, search_files, and list_dir. Aider works differently, generating diffs that get applied to files, but the underlying pattern is the same: the model reads state, produces a change, observes the outcome.
Why the Tool Schema Is the Real Design Surface
The description field in a tool schema is not documentation for humans. It is prompt engineering embedded in a type system. The model reads tool descriptions to decide which tool to call and how to use it, so a badly described tool leads to misuse rather than non-use.
Consider the difference between two versions of a file editing tool:
{
"name": "edit_file",
"description": "Edit a file",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
}
}
}
versus:
{
"name": "edit",
"description": "Replace an exact string in a file. Prefer this over write_file for targeted changes. The old_string must appear exactly once in the file.",
"input_schema": {
"type": "object",
"properties": {
"file_path": {"type": "string", "description": "Absolute path to the file"},
"old_string": {"type": "string", "description": "The exact text to replace"},
"new_string": {"type": "string", "description": "The replacement text"}
},
"required": ["file_path", "old_string", "new_string"]
}
}
The second version does something important: it constrains the model’s behavior through the schema itself. By requiring old_string and new_string, it forces the model to produce a targeted edit rather than regenerating an entire file. This reduces hallucination surface area significantly. When a model writes an entire file from scratch, it may silently drop code that was not in its attention. When it performs a string replacement, the scope of potential error is bounded.
This is why Aider’s SEARCH/REPLACE block format works well. It is a tool schema enforced via prompt convention rather than JSON schema, but the underlying principle is identical: narrow the model’s action space to reduce error modes.
The Filesystem as External Memory
General-purpose LLM agents, the kind popularized by AutoGPT and early LangChain experiments, struggle because they have no natural external memory. They need vector databases, explicit memory tools, and careful orchestration to remember what they did across many steps. When these systems fail, they lose track of their own state.
Coding agents have a solution built into the environment: the filesystem. An agent that writes a file has externalized that state. On the next iteration, it can read that file back. It does not need to remember writing it; it can verify the current state of the codebase directly.
This changes the failure mode substantially. A general-purpose agent that loses track of what it did in step three of a twenty-step process will produce incoherent output. A coding agent that loses track can run grep or read_file and recover. The filesystem is always consistent; the model’s memory is not.
The shell reinforces this. A coding agent running in a persistent bash session accumulates state: installed packages, defined environment variables, the current working directory. The shell is a second layer of external memory. When the model runs pip install requests and then calls import requests in a later step, the dependency is real. The model does not need to trust its own recollection.
Context Management Is the Hard Problem
The loop and the tool surface are straightforward. Context management is where most of the interesting engineering happens.
A typical codebase has far more content than fits in any context window, even the 200k-token windows available in current Claude models. Different agents address this differently.
Aider uses a repository map: a compact representation built from ctags that extracts function and class signatures, then ranks files by relevance to the current task using a graph algorithm similar to PageRank. The model sees a dense summary of the whole codebase and reads specific files only when needed. The --map-tokens parameter controls how many tokens the map consumes.
Cursor takes a retrieval approach, embedding the codebase at index time and performing similarity search at query time to pull relevant chunks into context. This works well for large unfamiliar codebases but can miss non-obvious relationships between files.
Claude Code uses explicit inclusion: the model decides which files to read via read_file calls, and those files accumulate in the conversation history. This is simple and transparent, but it requires the model to make good decisions about what to read upfront. The offset and limit parameters on read_file let the model read large files in chunks rather than loading them whole.
Anthropics’s prompt caching adds another dimension. When the system prompt and stable file contents appear early in the conversation, the inference server can cache that prefix and reuse it across calls. A long agentic run with many tool iterations benefits considerably from this: the model pays full price for the first call and reduced cost for subsequent calls where the prefix is unchanged.
The Verification Loop
The property that makes coding agents more reliable than any other category of LLM agent is the availability of automatic verification. Code either compiles or it does not. Tests either pass or they do not. A linter either reports errors or it does not.
This means the agent loop has a natural feedback mechanism: make a change, run the relevant checks, observe the output, try again. The model does not need to reason about whether its edit was correct in the abstract; it can execute the tests and read the results.
This is fundamentally different from a writing agent, a research agent, or a planning agent, where correctness is subjective and verification requires human judgment. Coding agents can close the loop autonomously in a way those categories cannot.
Devin’s architecture at Cognition AI makes this explicit: it runs a full VM with a persistent shell, browser, and editor, and it can run test suites, observe CI output, and iterate over hours on a single task. The length of the iteration horizon depends directly on how reliable the verification step is. When tests are comprehensive, the agent can be trusted to iterate further without human checkpoints.
Prompt Injection Is the Unresolved Problem
Willison has written about prompt injection as the fundamental unresolved security problem for agents that process external content. For coding agents, this is acute. An agent that reads a repository might encounter a file containing instructions designed to hijack its behavior. An agent browsing documentation might hit a page with injected instructions in a comment or metadata field.
Current mitigations are mostly procedural: system prompts that tell the model to ignore instructions in tool results, user confirmation steps before destructive actions, sandboxed execution environments. None of these are robust. The model’s architecture does not distinguish between instructions from the operator and instructions embedded in data; it processes them all as tokens in the context window.
This matters more as coding agents gain more autonomous operation. An agent that can push commits, open pull requests, and deploy to staging has a large blast radius if its behavior is hijacked. The minimal-privilege principle applies: coding agents should be granted only the permissions they need for the specific task, and destructive operations should require explicit confirmation.
What the Scaffolding Actually Does
Most of what gets marketed as a “coding agent” is scaffolding around the basic tool loop. The scaffolding handles permission prompting, context truncation, tool execution, error display, and user interaction. The model itself is doing the reasoning; the scaffolding is doing the plumbing.
This is worth understanding because it means the quality of a coding agent is mostly determined by three things: the quality of the underlying model, the design of the tool schemas, and the strategy for managing context. A more capable model in a well-designed agent loop will outperform a less capable model significantly. Upgrading the underlying model without changing the scaffolding usually produces immediate improvement.
It also means that building a minimal coding agent is not especially difficult. The loop is a few dozen lines of code. The interesting work is in tool design, context management, and the edge cases that emerge when the model makes unexpected decisions. Willison’s own llm CLI tool demonstrates this: a lightweight Python tool with plugin support that can act as an effective coding assistant without framework overhead.
The pattern is not going to get more complicated. The fundamental architecture of read context, produce action, observe result, repeat will remain the basis of coding agents for the foreseeable future. What will change is the quality of the models doing the reasoning, the richness of the tool surfaces they operate on, and the sophistication of the scaffolding that manages context across long tasks.