· 7 min read ·

Inside a Coding Agent: The Loop, the Edit Format, and the Verification Step

Source: simonwillison

The core of every coding agent is a loop. The LLM receives a task description and a set of tool definitions, produces a response containing a tool call, the runtime executes that tool, the result gets appended to the conversation, and the loop continues until the model emits a response with no tool call. Simon Willison’s guide on how coding agents work covers this architecture well, but the interesting engineering lives in the specific choices made about what tools exist, how they behave, and how the loop recovers when something goes wrong.

The ReAct Skeleton

The agent loop has a name: ReAct, from the 2022 paper by Yao et al. at Google Brain and Princeton. The key insight of that paper was that interleaving reasoning traces with concrete actions outperforms either approach alone. A model that only reasons hallucinates. A model that only acts cannot plan. Combining Thought/Action/Observation turns in the same generation produces agents that are both grounded and interpretable.

In practice, modern coding agents implement this through the tool-use APIs that every major LLM provider now offers. The model reasons in its response text, emits a tool call, the scaffold executes it, and the result comes back as the next user turn. The Thought/Action/Observation structure from the paper maps directly to response text, function call, and tool result in the API protocol.

What varies across agents is not this skeleton but the tool inventory and what each tool actually does.

The File Editing Problem

Of all the tool design decisions in a coding agent, the file editing format has the most impact on task success rates.

Three approaches are in common use, each with a different tradeoff between token cost, reliability, and failure behavior.

Full file replacement is the simplest: read the file, modify it in the model’s context, write the entire contents back. The model is working from the complete file, so context integrity is guaranteed. The cost is in tokens. A 500-line file requires 500 lines of output on every edit, even for a one-character change. At scale, this burns through budget quickly and puts pressure on the context window.

Unified diff is more compact. The model emits standard --- / +++ diff hunks, which a patch command applies. The problem is that LLMs are unreliable at producing correct line numbers. A model under context pressure will hallucinate that a function is on line 47 when it is on line 52, producing a diff that fails to apply. Aider’s benchmarking of edit formats found that search/replace blocks consistently outperform unified diffs by several percentage points on real coding tasks across both GPT-4 and Claude models.

Search-and-replace blocks eliminate line numbers entirely. Aider pioneered this format:

<<<<<<< SEARCH
def calculate_total(items):
    return sum(items)
=======
def calculate_total(items, tax_rate=0.0):
    subtotal = sum(items)
    return subtotal * (1 + tax_rate)
>>>>>>> REPLACE

Claude Code uses a str_replace_editor tool that takes old_string and new_string parameters. The contract requires that old_string appear exactly once in the file, which forces the model to include enough surrounding context to be unambiguous. Failure is loud: if the search string is not found, the tool returns an error rather than corrupting the file.

The tradeoff is that the model must reproduce existing code exactly. Trailing whitespace invisible in rendered context, inconsistent indentation, or a single character difference will cause a match failure. This explains why agents re-read a file immediately before editing it, even if they read it several turns earlier. Stale context risks a mismatch.

Codebase Navigation

Before anything can be edited, the agent needs to find the right file. On large repositories, this is where most of the context window budget goes.

The naive approach is to list all files and let the model pick. This fails at any meaningful scale. A repository with 10,000 files produces a listing that consumes significant tokens and still does not tell the model where the relevant logic lives.

Production agents layer multiple strategies. Text search with grep or ripgrep handles the majority of navigation tasks. Finding all call sites of a function, locating where a class is defined, or identifying which files import a particular module are all answerable with a pattern match against raw text. Most agents expose this as a dedicated search tool rather than raw shell access, so they can cap output length and structure the results before appending them to the conversation.

For more precise symbol resolution, some agents integrate with Tree-sitter, which parses source files into syntax trees using grammars for over 100 languages. The difference from grep matters when you need all definitions of a function rather than every string occurrence of its name, or when you want to enumerate class methods without parsing indentation manually.

The Language Server Protocol takes this further. LSP was designed for editor tooling and provides go-to-definition, find-all-references, type inference, and live diagnostics over a standardized interface. An agent with LSP access can navigate a codebase the way an IDE does, rather than treating it as a flat collection of text files. SWE-agent from Princeton explored LSP-based navigation and found it offers meaningfully better precision than text search alone, particularly for dynamically-typed languages where function names are not globally unique.

Semantic search over embeddings is available in some systems for initial orientation. “Find where authentication is handled” is easier to answer with embeddings than with grep. The practical limitation is that embedding search requires a prebuilt index and degrades on specific technical queries where exact symbol names matter more than semantic similarity. Most agents use grep and glob for the majority of navigation work and fall back to semantic search only for coarse-grained orientation tasks.

The Verification Loop

The capability that distinguishes a useful coding agent from an elaborate autocomplete is the ability to run the code and observe what happens. The shell execution tool closes this loop. After making an edit, the agent runs the test suite, reads the failure output, determines what went wrong, and makes another edit.

SWE-bench is the standard benchmark for this end-to-end capability. It presents real GitHub issues from open-source Python repositories along with the codebase, and measures whether the agent’s patch causes the associated tests to pass. Early baselines with GPT-4 and simple scaffolding scored around 2-4%. Current state-of-the-art systems reach 50-70% on the verified subset, a number that has risen dramatically as both models and scaffolding have improved. Many of the largest score jumps came from improved tool design and more robust edit formats rather than from new model releases alone.

Several things break the verification loop in practice. Flaky tests produce nondeterministic output that confuses the model’s reasoning about what changed. Long test suites limit how many verify-and-fix iterations the agent can run before context pressure becomes a constraint. Build systems requiring undocumented environment setup are a reliable failure point for agents running in clean environments.

The loop quality also depends on how informative the failing test output is. A test that asserts assertEqual(result, expected) and prints nothing on failure gives the agent less to work with than one that shows a diff between actual and expected values. This is a useful property to design for in any codebase that expects to be navigated by agents.

Context Window Management

Every tool result appended to the conversation shortens the remaining context available for reasoning. Long-running tasks will hit this limit, and the recovery strategy determines whether the agent degrades gracefully or loses track of what it was doing.

Truncation drops old messages from the beginning. This is cheap but loses the record of decisions made early in the task. An agent that forgets it already tried a particular fix will try it again.

Summarization calls the model to compress earlier turns into a shorter form. This preserves intent but loses specifics. A summary that records “the agent modified the authentication module” does not capture the exact lines changed, which matters if a subsequent edit needs to be consistent with the first.

Selective inclusion avoids the problem by being careful about what enters the conversation in the first place. Rather than appending the full contents of every file read, some systems track file state separately and include only the most recent version of files that are still actively relevant. Claude Code uses a periodic compaction step that summarizes conversation history while preserving the current task context and recent tool results.

For anyone building on agent frameworks, long-running tasks will hit context limits regardless of the strategy. The question is what happens at that boundary: whether the agent loses coherence entirely, repeats itself, or produces a graceful continuation.

Where the Engineering Lives

The loop structure is not the differentiator. Every production coding agent runs some version of ReAct with a tool inventory. The questions that separate capable agents from less capable ones are: which edit format fails gracefully rather than silently, how does the agent find relevant code without reading the entire codebase, what does the agent do when tests produce unexpected output, and how does the system behave when context fills. Those are tool design questions, not model questions, and the benchmark trajectory on SWE-bench over the past two years reflects it.

Was this interesting?