The standard explanation of how coding agents work goes like this: the model sees your request, thinks about what to do, calls some tools, and eventually your code is changed. That description is accurate, but it glosses over the part that matters most to reliability: how the model puts text into files without corrupting them, duplicating content, or missing the intended location entirely.
Simon Willison’s guide to agentic engineering patterns covers the agent loop clearly. The observe-orient-decide-act cycle underlying every coding agent from Claude Code to Aider to Cursor is not complicated: the LLM produces a response, the scaffolding parses any tool calls, executes them, appends the results to the conversation, and sends everything back to the model. The loop repeats until the task is complete or the context window fills up.
What the loop description leaves out is the editing step itself. Getting the model to identify which file needs to change, find the right location, and produce a correct modification without the scaffolding breaking; that is where most of the engineering lives.
The Three Approaches to File Editing
Coding agents have tried roughly three strategies for modifying files.
Full-file rewrite is the simplest: read the entire file, have the model produce the new version in full, then write it back. This works well for small files and avoids ambiguity about where edits land. The problem is cost and context pressure. A 500-line TypeScript file runs to roughly 4,000 to 6,000 tokens. If the model has to read and rewrite the whole thing to change three lines, a significant portion of context budget goes to content that isn’t changing. A task touching five or six files amplifies this further, and the context window becomes the binding constraint before the work is done.
Unified diffs were an early attempt at efficiency. The model produces a diff in standard unified format, and the scaffolding applies it with something like patch. This maps naturally to how developers think about code changes and is far more token-efficient than full rewrites. Aider’s early architecture used this approach and documented the failure modes in detail: models hallucinate line numbers, get context lines slightly wrong, and produce diffs that don’t apply cleanly. The patch utility has no tolerance for these errors. One wrong character in a hunk header and the whole operation fails. Getting LLMs to produce valid unified diffs reliably at scale turned out to be harder than expected.
String replacement is where most serious coding agents have landed. The model specifies an exact string to find and the string to replace it with. The scaffolding does a literal string search in the file and swaps it. Claude Code uses this pattern. Cursor uses a variant of it. The tool call structure looks roughly like this:
{
"tool": "str_replace_editor",
"old_string": "function calculateTotal(items) {\n return items.reduce((sum, item) => sum + item.price, 0);\n}",
"new_string": "function calculateTotal(items, taxRate = 0) {\n const subtotal = items.reduce((sum, item) => sum + item.price, 0);\n return subtotal * (1 + taxRate);\n}"
}
The model emits only the changed section, not the full file. The scaffolding handles the mechanical work of finding and replacing. Errors are easy to diagnose: either the old string is found and replaced, or it isn’t and the tool call fails with a clear message. The model can retry with more surrounding context if the initial match fails.
Why String Replacement Works Better Than It Should
String replacement seems fragile on the surface. Duplicate string matches and slight whitespace differences are real problems that come up regularly. But the failure modes are simple and recoverable, which makes the approach more robust in practice than unified diffs.
When the old string is not unique, a well-designed scaffolding returns an error asking the model to provide more surrounding context. The model adds a few more lines to make the match unique and retries. This loop is straightforward to implement and models handle it reliably. With unified diffs, the equivalent failure, a hunk that doesn’t apply, is much harder to diagnose because line numbers may have shifted since the file was read.
String replacement externalizes the targeting problem. The model doesn’t need to count lines or track offsets. It reads a section of the file, decides what needs to change, and quotes back the exact text it wants to replace. The scaffolding does the matching. This division of labor maps to what LLMs are good at (understanding and transforming text) versus what they’re unreliable at (maintaining precise positional state across a long context).
Aider’s edit format benchmarks show measurable differences in task completion rates across formats, and the winning format varies by model. What works best for Claude does not necessarily work best for GPT-4o or Gemini 2.5. The variance is large enough that the edit format is a first-class engineering decision, not an implementation detail.
The Read-Then-Edit Pattern
The editing mechanism depends on a specific sequence: search to locate relevant files, read to load their content into context, then edit. Skipping the read step is a common failure mode in naive agent implementations. The model tries to edit a file it hasn’t seen and either fabricates the old string or produces a generic replacement that doesn’t match the actual code.
This is why coding agents have dedicated search tools that return file paths and line numbers without pulling full file content into context. A grep over a large repository might identify twenty candidate files in under a second while consuming relatively few tokens. The agent then reads just the two or three files that are actually relevant, loading them into the conversation window before editing.
The context window is the central constraint the entire tool design is optimizing around. Every tool in a coding agent’s toolkit can be read as an answer to one question: how do we give the model the information it needs without filling the context with things it doesn’t need? Glob for broad file discovery, grep for content search, read for specific file loading, and string-replace for targeted editing each occupy a different position on the cost-versus-specificity tradeoff.
LSP and Semantic Navigation
File-level grep is fast but semantically blind. Searching for calculateTotal finds every occurrence of that string, including comments, test fixtures, and variable names that happen to share the prefix. A Language Server Protocol integration goes further: find all references to this specific symbol, jump to its definition, list all callers with their types.
Some coding environments are starting to expose LSP data directly as agent tool calls. Instead of grep "calculateTotal" --include="*.ts", the agent can call find_references("calculateTotal", file="src/billing.ts", line=42) and get back only the places where that specific definition is used. This reduces noise in search results and often means the agent reads fewer files to complete a task, saving both context and latency.
The tradeoff is infrastructure complexity. LSP servers need to be running and indexed before the agent session starts, which adds startup time and memory overhead. For large codebases with many cross-file dependencies, this is worth it; for quick single-file edits, it is unnecessary machinery. Most terminal-based agents skip it, while IDE-integrated tools like Cursor can lean on the LSP that’s already running for the editor.
What the Scaffolding Actually Does
The word “scaffolding” gets used loosely in discussions of coding agents. To be precise: the scaffolding is everything that is not the model. It is the code that formats tool calls into the conversation, executes them, handles errors, manages the context window as it fills up, and decides when to stop.
The model itself has no memory between turns beyond what is in the current context window. The scaffolding maintains state. When Claude Code tracks which files it has already read, or when Aider maintains a list of files added to the session, that is scaffolding state, not model state. The model knows only what was placed in its context for the current request.
This matters because the quality of a coding agent is substantially a function of its scaffolding decisions: which tools are exposed, how errors are presented to the model, when earlier conversation turns get truncated or summarized, how the system prompt frames the task. Two agents running the same underlying model can behave very differently depending on how the scaffolding is designed. The model is a commodity in a way that the scaffolding is not.
Simon Willison’s framing of this work as “agentic engineering patterns” is apt. There is a layer of software engineering sitting between raw model capability and a working tool, and that layer is where most of the practical differences between coding agents show up.
The Remaining Open Problems
The edit mechanism is reliable for single-file changes but becomes harder to manage across coordinated multi-file edits. When a refactor touches an interface definition, its implementations, the tests, and the documentation in the same pass, the agent needs to track dependencies across edits it hasn’t made yet. Getting this sequencing right without reading and re-reading files to confirm prior edits took effect is an active area of improvement.
The deeper open question is whether specialized fine-tuning for code editing will matter more than format choice and prompt engineering. Models trained specifically on read-then-edit sequences, learning from thousands of examples of correct string replacements in real codebases, might handle edge cases better than a general model following a format description in its system prompt. Some evidence points in that direction, but the picture is not yet clear.
For now, the string replacement loop, read-before-edit discipline, and search-first navigation remain the reliable foundation that most serious coding agents are built on. The loop itself is simple; the engineering is in the details of how the model is given exactly the information it needs and no more.