Simon Willison recently published a thorough guide called How Coding Agents Work as part of his agentic engineering patterns series. It’s worth reading in full. But rather than summarizing it, I want to zoom into the mechanics that most write-ups gloss over: how agents actually find code, edit files without corrupting them, and manage a context window that will eventually run out.
The Loop
Every coding agent, whether it’s Claude Code, Cursor, Aider, or something custom, is built on the same skeleton. A model receives a message. It responds with either a final text reply or a tool call. If it calls a tool, the result gets appended to the conversation, and the model responds again. This continues until the model stops calling tools.
That’s it. The sophistication lives in the tools, the system prompt, and how context is managed, not in some novel architecture.
The OpenAI function-calling format, now widely adopted, encodes this cleanly. A tool call in the conversation looks roughly like:
{
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "read_file",
"arguments": "{\"path\": \"src/auth/session.ts\"}"
}
}
]
}
The tool result gets appended as a separate message with role tool, keyed by that same ID. The model sees the result and decides what to do next. There is no magic here, no hidden reasoning layer, just a conversation that grows with each iteration of the loop.
Anthropics’s documentation on tool use is explicit about this structure. The model is not executing the tools. The host application is. The model only outputs the intent.
What Tools a Coding Agent Actually Needs
The minimal viable toolset for a coding agent is smaller than you might expect:
- Read file: given a path, return the contents, usually with line numbers
- Write or edit file: modify file contents
- Run a shell command: execute arbitrary commands and return stdout/stderr
- Search: find files by name pattern or search file contents by regex
Everything else is optimization or convenience. Tools like list_directory, get_file_info, or dedicated grep wrappers exist because they’re cheaper than shelling out and because they can return results in a format the model handles well without stripping ANSI codes or dealing with large binary outputs.
Claude Code, for instance, exposes dedicated Glob and Grep tools precisely because find and grep via bash work but produce noisier output and cost more tokens to parse. Aider, by contrast, keeps the toolset minimal and relies heavily on bash for most operations.
The shell execution tool is the most powerful and most dangerous. With it, an agent can install packages, run tests, make network requests, and delete files. Most production tools either sandbox this (running in a container or VM) or require user confirmation for each invocation. Claude Code prompts for approval on the first bash execution and then auto-approves subsequent ones within the same session by default, which is a pragmatic tradeoff.
How Agents Navigate a Codebase
A 200,000-token context window sounds large until you consider that a mid-sized codebase might have hundreds of files totaling several megabytes. You cannot read everything. Agents navigate this the same way an experienced developer would: start with the structure, then drill into relevant files.
The common pattern:
- Read
README,package.json,pyproject.toml, or whatever the project manifest is - Use
globto get a high-level picture of the file tree - Search for relevant symbols, function names, or error messages using grep
- Read specific files identified as relevant
- Follow imports and references to adjacent files
This is essentially how a senior engineer approaches an unfamiliar codebase. The agent is doing it with tools rather than a file browser, but the heuristics are the same.
Some tools go further by integrating with Language Server Protocol, which gives the agent semantic navigation: go-to-definition, find-references, type information. Cursor uses this to great effect. LSP integration means the agent can ask “where is authenticate defined” and get a precise answer without grepping through hundreds of files hoping for a match. The tradeoff is setup complexity and the need for a running language server, which is not always feasible in headless environments.
Aider takes a different approach with its repository map: it uses tree-sitter to parse the entire codebase, extract function and class signatures, and build a condensed map of the repo that fits in the context window. The agent sees signatures and file locations without seeing full file contents, then reads specific files on demand. This is smart engineering. It trades the fidelity of full file contents for breadth of coverage.
The Hard Part: Editing Files Without Breaking Them
Reading files is easy. Editing them correctly is where most of the interesting engineering lives.
The naive approach is to have the model output the entire new file contents, then write that. This works but is expensive in tokens, slow for large files, and fragile: if the model hallucinates a section it did not read, you lose code.
The smarter approach is a diff-based or search-and-replace edit format. Instead of outputting the whole file, the model outputs only what changed. Aider pioneered this with its SEARCH/REPLACE block format:
<<<<<<< SEARCH
def authenticate(user, password):
return check_password(user, password)
=======
def authenticate(user, password, mfa_token=None):
if not check_password(user, password):
return False
if mfa_token:
return verify_mfa(user, mfa_token)
return True
>>>>>>> REPLACE
The host application finds the SEARCH block in the file, verifies it matches, and replaces it with the REPLACE block. If the SEARCH block does not match, the edit fails and the model is told why. This is robust to large files because the model only needs to output the changed section.
Claude Code’s Edit tool works on a similar principle, taking old_string and new_string parameters and requiring the old string to be unique in the file. If it is not unique, the tool errors and asks the model to provide more context. This strictness prevents silent wrong edits.
GitHub Copilot’s workspace agent uses a unified diff format, closer to what git diff produces. This is familiar to models that have seen enormous amounts of diff output in training data, but it requires the model to produce correctly formatted patches with line number offsets, which is more error-prone than the search-replace approach.
There is a newer pattern worth noting: the “apply model” architecture. The reasoning model produces a high-level description of the change, and a separate smaller, faster model actually generates the file edit. Cursor uses this with their Instant Apply feature. The primary model does not need to output the full diff; it just describes what should change, and the apply model handles the mechanics. This keeps the primary model’s output focused on reasoning rather than code generation.
Context Management Is the Real Engineering Challenge
The agent loop accumulates context. Every file read, every tool result, every model response gets appended. Eventually you hit the context limit, and something has to give.
The simplest strategy is to just fail and ask the user to start a new session. This is what most early implementations did. It is not great.
More sophisticated approaches include:
Compacting: summarize old conversation turns into a condensed representation, dropping the raw tool outputs but keeping the conclusions. Claude Code does this automatically when the context grows large, inserting a summary in place of the older messages. You lose fidelity but gain headroom.
Prompt caching: Anthropic’s API supports prompt caching, which lets you mark a prefix of the conversation as cacheable. The system prompt and any stable context gets cached and reused across turns, significantly reducing both latency and cost. For long coding sessions where the system prompt is large and fixed, this is a meaningful optimization.
Task decomposition: rather than running one giant agent session, break the task into subtasks and run separate sessions for each, passing summaries between them. This is how more ambitious multi-step workflows stay tractable. The context per session stays manageable; the coordination happens at a higher level.
Selective tool output truncation: read file tools can return just a line range instead of the full file. Search tools can return just filenames instead of matching lines. This reduces token consumption at the cost of sometimes needing a follow-up read. Claude Code’s Grep and Read tools both support this via parameters.
The System Prompt Does More Work Than You Think
The system prompt is where coding agents establish their identity, constraints, and capabilities. It typically includes the tool definitions (though these are often passed separately in the API), instructions on when to ask for confirmation versus proceeding autonomously, coding style preferences, and safety constraints.
Many tools support project-level configuration that gets injected into the system prompt. Claude Code reads CLAUDE.md files from the project root and parent directories. Aider reads .aider.conf.yml. This lets teams encode conventions, architecture decisions, and context that the model should know without the user having to repeat it in every session.
The quality of this project context file matters more than most people realize. A well-written CLAUDE.md that explains the project’s module structure, testing conventions, and common gotchas dramatically reduces the number of wrong tool calls the agent makes early in a session, because it does not need to explore to discover things that are already stated.
What This Means for Building With Agents
If you are integrating coding agent capabilities into your own tooling, the practical takeaways are:
Design your tools to fail informatively. When an edit cannot be applied because the search string does not match, say exactly why and include the first few characters that differed. The model uses that feedback to correct itself on the next attempt.
Keep tool outputs dense but structured. Raw terminal output with ANSI escape codes, progress bars, and timing lines is noise. Strip it before returning it to the model. Return the information the model needs to decide what to do next, not everything the command produced.
Be explicit about what requires confirmation. The model will make mistakes. The question is whether those mistakes are reversible. File edits can be rolled back with git. Dropped database tables cannot. Structure your tool permissions around reversibility, not just scope.
The agent loop is simple. The engineering is in the details around it.