· 7 min read ·

How Coding Agents Find the Code They Need to Change

Source: simonwillison

Every coding agent is built on the same primitive: a while-loop that calls a language model until it signals it is done. The loop accumulates a growing message history, dispatches tool calls, appends results, and calls the model again. Simon Willison’s agentic engineering patterns guide explains this core mechanic clearly. But the loop itself is not where the interesting architectural decisions live. The decisions that actually separate one tool from another are in how they solve three hard problems: finding the relevant code, applying edits reliably, and fitting enough context into a fixed token budget.

The Loop in Concrete Terms

To ground the discussion: when Claude Code reads a file, it is not “reading” in any special sense. The model emits a JSON object like this:

{
  "type": "tool_use",
  "id": "toolu_01XFDUDYJgAACTJDEkjTWPjq",
  "name": "read_file",
  "input": { "path": "src/auth/service.py" }
}

The host process reads the file and appends its contents as a tool_result block. The model is called again with the extended history. This repeats until the model emits a response with no tool calls, at which point the loop exits. There is no persistent state, no background process, no special memory. Just a growing list of messages and a model that generates the next step from whatever fits in context.

The sophistication is entirely in what tools are available and what goes into the context window.

Navigating a codebase is where the most architectural diversity exists, because there is no obviously correct solution. A large repo might have millions of tokens of code. No current context window can hold all of it. The question is what to load and when.

Grep and bash is Claude Code’s primary approach. The model generates shell commands:

rg -n "class AuthService" --type ts
find . -name "*.config.js" -not -path "*/node_modules/*"

The output lands in the context as a tool result. This is fast, requires no indexing, and works on any language. It fails for semantic questions: if you want to find every call site of a function that has been aliased or dynamically dispatched, grep will not help. The model must reason about what to look for before it looks.

The repository map is Aider’s most distinctive contribution. Rather than letting the model grep blindly, Aider uses tree-sitter to parse every file in the repo into an AST, extracts all function and class definitions along with all cross-file references, builds a graph, and runs a PageRank-style algorithm to identify the symbols most relevant to whatever files the user has added to the chat. The result is a compact text summary injected at the start of every prompt:

src/auth/service.py:
  class AuthService:
    def login(self, username: str, password: str) -> Token
    def logout(self, token: Token) -> None

src/auth/models.py:
  class Token:
    access: str
    refresh: str
    expires_at: datetime

This gives the model a table of contents for the codebase without requiring it to read individual files first. The map size is dynamically budget-constrained via the --map-tokens flag (default around 1024 tokens), and it is regenerated each turn as the conversation evolves. The key insight behind the repomap is that structural navigation, symbol definitions, and cross-file references can be compressed into a small token representation without embeddings or an index server.

LSP integration is Cursor’s primary advantage for precision. Cursor runs language servers in the background (tsserver for TypeScript, rust-analyzer for Rust, pylsp for Python) and uses them to answer questions like “where is this function defined” or “what are all the call sites of this method.” LSP responses are exact: they give file paths and line numbers, not probable matches. This makes Cursor significantly more reliable for large-scale refactors where a grep might miss dynamic references or where the same symbol name appears in multiple scopes.

Embedding-based semantic search is used by both Cursor and GitHub Copilot. Files are chunked, embedded, and stored in a local vector index. The user’s current query or the model’s current focus is embedded and matched against the index, returning semantically related chunks. Cursor’s @codebase command triggers this explicitly. The tradeoff is index freshness and the cost of maintaining the index as the codebase changes.

None of these approaches is strictly superior. Grep is always accurate for exact strings. The repomap handles cross-file structure without a running server. LSP gives precise semantic answers but requires the language server to be functional and initialized. Embeddings handle semantic drift but can return irrelevant results when code and query vocabulary diverge. Most production agents are moving toward combining multiple strategies.

Editing: Why Format Matters More Than It Should

Once an agent knows what to change, it has to apply the change without corrupting the file. This is less trivial than it sounds.

Aider uses SEARCH/REPLACE fenced blocks:

path/to/file.py
<<<<<<< SEARCH
def old_function(x):
    return x * 2
=======
def old_function(x, multiplier=2):
    return x * multiplier
>>>>>>> REPLACE

The host process finds the exact string between the SEARCH markers and replaces it. If the string is not found verbatim, it falls back to fuzzy matching via Python’s difflib.SequenceMatcher. Aider also supports unified diff format and whole-file rewrite, and it tracks which formats are succeeding; if one format is producing invalid edits, it switches automatically.

Claude Code uses a structured JSON tool call:

{
  "name": "edit_file",
  "input": {
    "path": "src/utils.py",
    "old_string": "def parse_date(s):\n    return datetime.strptime(s, '%Y-%m-%d')",
    "new_string": "def parse_date(s: str) -> datetime:\n    return datetime.fromisoformat(s)"
  }
}

The semantics are identical to Aider’s SEARCH/REPLACE, but the format is JSON rather than markdown convention. When the search string is not found, the tool returns an error to the model, which then reads the current file state and retries. This fail-observe-retry loop is essential: files change during a session as earlier edits land, and the context the model used to generate an edit may no longer match the current file.

Whole-file rewrite is the most robust format. There are no search strings that can fail to match, and no patch-application edge cases. The cost is proportional to the file size in both input and output tokens. For files under a few hundred lines, this is often the right default.

The SWE-agent paper from Princeton (2024) demonstrated that the design of the editing interface matters enormously, arguably more than which model you use. They built a custom file viewer with line numbers and a specialized edit command, and found that this “agent-computer interface” dramatically outperformed using the same model with naive file manipulation. The interface shapes how reliably the model can describe its intended change, and reliability compounds across a sequence of edits.

Context: The Primary Constraint

Context window limits are the resource constraint that drives most other architectural decisions. Claude 3.7 Sonnet has a 200K token context. A medium-sized Python project might have 500K tokens of source code. A large one might have 5 million.

Claude Code is reactive about context: the model reads what it decides to read, and the accumulated history of reads and edits fills the window over the course of a session. When the window fills, Claude Code runs a compaction step where the model summarizes the conversation history into a compressed form, drops the original messages, and continues with the summary.

Aider is more explicit. Files must be manually /added to the chat before the model can edit them. This keeps the context small and predictable. The repomap provides structural awareness without loading file contents. The tradeoff is that the user must anticipate which files are relevant, though Aider will suggest files when it detects they are likely needed.

Copilot Workspace takes the most constrained approach: before writing any code, it generates a structured plan of which files will be changed and how. Only the plan-relevant files are loaded. The user can review and edit the plan before execution begins. This plan-then-execute architecture uses context budget very efficiently but loses the flexibility of an observe-act loop that can discover unexpected file dependencies mid-task.

Autonomy as a Spectrum

The dimension that matters most to users is how much the agent does without asking. Claude Code’s bash tool is unrestricted: the model can run arbitrary shell commands, delete files, make network requests, and push to git. It will do so autonomously until the task is complete or it gets stuck, pausing only for operations it has been configured to flag (like git push or file deletion). Aider sits nearby on this spectrum, with autonomous test running and auto-commit after each accepted edit.

Cursor’s Composer mode and Copilot Workspace are more conservative: Cursor requires the user to run tests and paste results back, while Copilot Workspace gates execution behind plan review. The fully autonomous end is more capable on benchmark tasks like SWE-bench, where Claude 3.7 Sonnet resolves around 49% of real GitHub issues end-to-end. The less autonomous tools trade raw capability for predictability and user control.

What the SWE-agent research made clear is that the loop, the tools, and the interface design collectively determine capability more than raw model intelligence does. The model is the same; the scaffolding is what differs. Building a coding agent is primarily a software engineering problem, not a prompting problem.

Was this interesting?