The core mechanic of every coding agent is simpler than the marketing suggests, and the edge cases are where all the interesting engineering lives. Simon Willison’s guide on agentic engineering patterns lays out the foundations clearly. What I want to do here is go a level deeper on the parts that trip people up when they first try to build or understand these systems.
The Loop
A coding agent is, at its core, a while loop. You send a message to an LLM. The LLM either responds with text, meaning it is done, or with a tool call, meaning it is not. If it returns a tool call, you run the tool, append the result to the conversation, and call the LLM again. You keep doing this until the model stops requesting tools.
In Claude’s API, tool calls come back as tool_use content blocks in the response:
{
"type": "tool_use",
"id": "toolu_abc123",
"name": "read_file",
"input": { "path": "src/auth.py" }
}
You execute read_file("src/auth.py"), then append a tool_result message with the file contents back into the conversation, and send the whole thing to the model again. OpenAI uses function_call in the assistant message and a tool role message for the result. The structure differs slightly, but the logic is identical.
This pattern has a name: ReAct, from a 2022 paper by Yao et al. that described interleaving reasoning traces with actions. The coding agent version is a direct application of that idea, except the reasoning is the model’s chain-of-thought and the actions are file operations and shell commands.
What Tools Actually Exist
The tool set matters a lot. A minimal coding agent needs at least:
read_file(path): returns file contentswrite_file(path, content): creates or overwrites a filebash(command): runs a shell command and returns stdout and stderrlist_dir(path)andglob(pattern): navigation
The more interesting design question is whether to include targeted edit tools like str_replace(path, old_string, new_string). Claude Code uses this. Aider has a similar mechanism via structured diff formats. The reason matters: if you are modifying a 500-line file but only changing three lines, you do not want the model to output the entire file again. That doubles your token usage and introduces risk, since the model might silently alter something else while rewriting.
Targeted edits also force precision. If old_string does not exist in the file, the edit fails, and the model has to reconsider. This catches a class of bugs where the model’s mental model of the file has drifted from reality, which happens more often than you would expect in long sessions.
The Context Window Problem
The context window is the total amount of text the model can hold at once: the system prompt, every message in the conversation, every tool call, every tool result. Claude has a 200k token window. GPT-4o has 128k. Gemini 1.5 Pro extended this to 1 million, though throughput and latency tradeoffs make that ceiling less useful than it sounds.
In practice, a long coding session generates tokens fast. Every file you read goes into the context. Every bash command output goes in. Every exchange stacks up. A 2000-line file is roughly 15,000 to 20,000 tokens. Read five of those and you have consumed most of a 128k context before the model has written a line.
The solutions break into two categories: limiting what goes in, and compressing what is already there.
Limiting input means the agent has to be deliberate about what it reads. Good coding agents do not read directories speculatively. They navigate: list a directory, identify relevant files, read only those. Aider handles the discovery problem with a repo map: a condensed representation of the entire codebase showing file names plus all top-level function and class signatures. The model sees the shape of the whole repo and requests specific files on demand, rather than reading everything up front.
Compressing existing context is harder. Claude Code has a /compact command that summarizes the current conversation, replacing the full history with a condensed version. This is a destructive operation: you lose detail in exchange for headroom. The tricky part is that the summary must be good enough that the model does not lose track of decisions it already made. Badly compressed summaries cause the model to repeat work or contradict earlier choices.
Some systems use sliding windows, dropping old messages from the beginning of the conversation. This is simpler but dangerous: you might discard the original task specification, or discard the content of a file the model still needs to reference.
Planning and Task Tracking
Before writing code, a well-structured agent makes a plan: it creates a list of steps, marks each one as it completes it, and checks back against the list. This is visible in Claude Code’s todo management, which the model can use to track its own progress.
This serves several purposes. It forces decomposition before execution, which surfaces missing steps early. It provides a persistent state that survives context compression, since a short task list is cheap to carry forward. And it gives the human a legible trace of what the agent is doing, which matters for knowing when to intervene.
The intervention question is underappreciated in most coverage of coding agents. Long-running tasks go wrong in subtle ways. The model might interpret a requirement differently from what you meant. It might pursue a refactoring path that is technically correct but not what you wanted. The right mental model for a coding agent is a very fast developer who needs occasional steering, not a system you fire and forget.
Error Recovery
When a bash command returns a non-zero exit code, that error goes back into the context as a tool result. The model reads it and can adjust. This recovery loop is what makes agents resilient: they do not crash on errors, they treat errors as information and try again.
But this also means a broken environment can push the model into a retry spiral. If a dependency is missing and every install command fails with the same error, the model might run variations of the same command five times before concluding it cannot proceed. Production agents benefit from explicit retry limits or from detecting repeated failure patterns and escalating to the user.
Prompt Injection
Because tool results go directly into the context as text, anything that looks like an instruction can influence subsequent model behavior. A malicious file, a crafted shell output, a web page the agent fetches: any of these could contain text that the model treats as a directive. This is prompt injection, and it is a genuine risk for agents that read files from untrusted sources or browse the web.
The mitigations are imperfect. Separating tool result content from instruction content with markers helps somewhat. Sandboxed execution environments reduce the blast radius of a successful injection. But the fundamental problem, that the model cannot fully distinguish between data and instructions when both arrive as text, remains an open research question.
A Minimal Implementation
If you want to understand this from first principles, the smallest possible coding agent is about 50 lines of Python. The Anthropic tool use documentation walks through this pattern clearly.
tools = [read_file_tool, write_file_tool, bash_tool]
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-opus-4-6",
tools=tools,
messages=messages
)
if response.stop_reason == "end_turn":
break
messages = handle_tool_calls(response, messages)
Everything layered on top of this, context management, task tracking, permission systems, user interface, is scaffolding that makes this loop more reliable, more observable, and more controllable. The scaffolding matters enormously in practice. But understanding that the loop is the core makes the rest easier to reason about, and makes it easier to diagnose when something goes wrong.
What the better coding agents have figured out is that the hard part is not the model itself. The model is already capable enough for most tasks. The hard part is structuring the context so the model always has what it needs and nothing it does not, handling failures gracefully, and giving the human enough visibility to catch mistakes before they compound.