· 9 min read ·

Under the Hood: How Coding Agents Actually Navigate and Execute

Source: simonwillison

Simon Willison’s guide on agentic engineering patterns is a good map of the territory. It covers the loop, the tools, the rough shape of how these systems operate. What it does not dwell on is the mechanical layer underneath: why the loop works the way it does at the API level, how agents solve the navigation problem without blowing their context budgets, and where the failure modes actually live. That is what this post is about.

The Loop at the Wire Level

Every coding agent runs on some variation of the same cycle: send a prompt, get a response, execute something, feed the result back, repeat. The interesting part is how the API enforces the structure of that cycle.

With the Anthropic Messages API, tools are declared upfront as JSON Schema objects:

{
  "name": "read_file",
  "description": "Read the contents of a file at an absolute path.",
  "input_schema": {
    "type": "object",
    "properties": {
      "file_path": { "type": "string" },
      "offset": { "type": "number" },
      "limit": { "type": "number" }
    },
    "required": ["file_path"]
  }
}

When the model wants to call a tool, it returns a response with stop_reason: "tool_use" and a content block of type tool_use. The scaffolding parses this, dispatches to the actual implementation, then appends a tool_result block back into the conversation with the output. The model never sees raw stdout directly; it sees a structured message in the conversation history that it can reason about.

This matters because the conversation history is the agent’s entire working memory. Everything the agent has read, every command it has run, every error it has encountered, all of it lives as messages in that history. The model does not have a separate scratchpad or persistent state. Its “mental model” of the codebase is whatever it can reconstruct from the sequence of tool results sitting in the context window.

The Navigation Problem

A typical production codebase has tens of thousands of lines across hundreds of files. A 200,000-token context window sounds generous until you do the arithmetic: a single large source file with comments and imports can run 2,000 to 5,000 tokens. Read ten of them and you have consumed a quarter of your budget before writing a single line. Agents cannot read their way through a codebase; they have to navigate it.

There are four strategies in use across production coding agents, and they make different tradeoffs.

Glob and Grep: Cheap and Reliable

Glob finds files by path pattern. Grep searches content. Together they are the cheapest possible navigation primitive: one or two tool calls, minimal tokens, no infrastructure required.

Claude Code’s default strategy is iterative grep-and-glob. The agent starts from whatever anchor the task provides, a function name, a file path, an error message, and traces outward. grep -r "authenticate_user" . produces a list of file paths and matching lines. The agent picks the most relevant hit, reads that file, follows imports, and repeats. Each cycle adds more signal.

The cost is latency in tool calls, not tokens. Finding the relevant code in a moderately complex codebase might take 15 to 20 grep and glob calls before the agent has enough context to act with confidence. For tasks that touch a narrow surface area, this is fine. For tasks requiring broad changes across many modules, it compounds.

Grep also has a precision problem. Search for a common identifier like config and you get hundreds of hits. The agent has to apply judgment about which results are actually relevant, and that judgment can be wrong, especially when identifiers are reused across unrelated subsystems.

LSP: Precise but Fragile

A language server gives exact answers. Go-to-definition returns a single location. Find-all-references returns every call site. No false positives from common strings, no ambiguity about which config is the right one.

Claude Code exposes an LSP tool for typed languages. For TypeScript and Go codebases, it is substantially more accurate than grep on shared identifiers. The agent calls the LSP tool with a file path and position, and gets back exact navigation results.

The fragility is that LSP requires a working language server. In practice, that means the project has to be configured correctly, dependencies installed, and the relevant language server available in the environment. On well-maintained TypeScript monorepos, this works reliably. On mixed-language repos or projects with non-standard setups, it often does not, and the agent falls back to grep anyway.

Aider’s repository mapping takes a different approach. At session start, it parses the entire codebase using tree-sitter to extract function signatures, class names, method names, and import relationships. No code execution, just syntax tree analysis. The result is a compact structural map of the entire codebase, typically 1,000 to 8,000 tokens, that gets included in every prompt.

The model sees codebase structure before any search. When asked to fix a bug in the authentication flow, it already knows that authenticate_user lives in auth/session.py and is called from api/endpoints.py. Navigation becomes targeted rather than exploratory.

Aider trims this map dynamically as conversation history fills the context. Files touched recently get higher priority; rarely accessed files get dropped. The assumption is that as a task progresses, the relevant surface area narrows. This assumption holds for linear tasks; it breaks for tasks that unexpectedly require breadth late in the session.

Embedding Search: Semantic but Noisy

Cursor and GitHub Copilot maintain vector indexes of chunked source files. At query time, the agent retrieves the top-K chunks by embedding similarity to the current task description. This finds conceptually related code even when the naming convention differs from the query.

The infrastructure requirement is the obvious barrier: a synchronized vector index and an embedding model. Beyond that, embedding retrieval introduces false positives. Code that shares domain vocabulary with the query appears in results regardless of whether it is actually relevant. A search for “user session handling” might surface code from a logging module that mentions user sessions in a comment. The agent still has to evaluate relevance, it just gets more candidates to sort through.

For very large codebases where no single prompt-based strategy can build comprehensive coverage, embedding search is often the only option. The noise is a cost of scale.

Read Before Edit, Always

One of the most important behavioral properties of a coding agent is whether it reads a file before attempting to edit it. This is not a stylistic preference; it is the difference between accurate edits and hallucinated ones.

String replacement, which Claude Code uses as its primary edit mechanism, requires the old_string to match the file content exactly. A model that reconstructs the file from memory rather than re-reading it before editing will produce slightly wrong old_string values: close enough to look plausible, wrong enough to fail the match. The scaffolding returns an explicit error, the model tries again with a different guess, and the retry loop burns tokens without making progress.

Agents designed to skip re-reads for efficiency pay for it in edit failure rates. The token cost of re-reading a file before editing it is almost always worth it. The alternative is a retry cycle that costs more and is less reliable.

The same discipline applies to large files. Claude Code’s read tool accepts offset and limit parameters for a reason. Reading a 3,000-line file to edit line 2,847 is wasteful; reading lines 2,830 to 2,870 with surrounding context is precise. Agents that read entire large files by default fill their context with irrelevant content.

Context Window Arithmetic

The constraint that shapes every other decision is the context window. 200,000 tokens is the budget for the system prompt, the full conversation history including all tool calls and results, and everything the model needs to produce its next response.

System prompts for production coding agents are not small. Tool definitions, behavioral instructions, and scaffolding metadata can consume 5,000 to 10,000 tokens before the task starts. Conversation history accumulates with every tool call. A verbose bash command that dumps a dependency tree might produce 20,000 tokens of output in a single tool result.

This is why bash output truncation is a real engineering concern, not a cosmetic one. If a test suite run produces 50,000 tokens of output, the scaffolding cannot include all of it in the tool result. It has to truncate, and where it truncates determines whether the model sees the actual failure. Most agents truncate from the end, which is usually wrong for test runners: the summary and failure details typically appear at the end of the output, not the beginning. Agents that truncate from the beginning, or that extract failure lines specifically, get better signal from test runs.

As context fills, older content gets pushed toward the beginning of the window. Transformer attention is not uniformly distributed across position. Content from many turns ago receives less attention than recent content, which means files the agent read early in the session may effectively become inaccessible even though they are technically still in the context. Agents that need to reference early observations often re-read files to get clean signal, which is the right behavior even if it looks redundant.

Where Agents Actually Fail

The failure modes in practice are consistent across different agent implementations.

Context fills before the task is complete. The agent has spent its budget navigating and reading, and now has no room left to make the actual edits. This happens most often on tasks that require broad changes across many files, where navigation is expensive and the relevant surface area is large.

Grep returns too many results. The agent searches for a common identifier, gets 200 matches, and has to guess which ones are relevant. It guesses wrong, reads the wrong files, and builds its understanding on the wrong foundation. Later edits look plausible but touch the wrong code path.

Over-eager file reads. The agent reads large files speculatively, hoping they contain relevant context. Many do not. The tokens spent on irrelevant content are gone; if the context fills, the agent cannot recover them.

Edit failure loops. The agent attempts a string replacement, fails because its old_string does not match exactly, adjusts slightly, fails again, and cycles. This usually resolves within two or three retries if the underlying issue is a minor misquotation. If the model has genuinely hallucinated a code block that does not exist in the file, the loop continues until the agent gives up or the context budget is exhausted.

Late-session navigation failure. The repository map or recently-read file list has been trimmed from context to make room for new content. The agent can no longer recall the structural overview it had at the start of the session and has to re-navigate ground it already covered.

What the Tool-Calling Format Actually Enforces

The JSON Schema tool declaration is not just documentation. It enforces structure at the API level: the model cannot call a tool with arguments that do not match the declared schema. This eliminates a whole category of parsing errors that plagued earlier agent architectures that extracted tool calls from free-form text.

The discipline that the format does not enforce is semantic correctness. The model can call read_file with a syntactically valid path that does not exist, or call bash with a syntactically valid command that does the wrong thing. The scaffolding has to handle these cases gracefully, return useful error messages, and give the model enough information to self-correct.

Tool description quality matters more than most people expect. A tool described as “read a file” produces different behavior than one described as “read a file at an absolute path; use offset and limit to read large files in sections; always re-read a file before editing it.” The description is part of the system prompt the model reasons from. Good descriptions encode best practices directly into the tool’s affordance.

That is the real design space in coding agents. The model is a fixed component. The tools around it, how they are defined, what they return, how they handle errors, what they truncate, are where most of the interesting engineering happens. Agents that feel different to use are usually different at the tool layer, not the model layer.

Was this interesting?