The Agent Loop Is Trivial; The Tool Interface Is Where Coding Agents Actually Differ

Sebastian Raschka published a thorough breakdown of coding agent components that has been making the rounds. It covers the canonical anatomy: the agent loop, tool use, context management, and verification. It is worth reading. What I want to do here is go deeper on the parts that actually separate good agents from mediocre ones, because the loop itself is almost embarrassingly simple.

The core of every coding agent looks like this:

while not task_complete:
    action = llm.call(system_prompt, history, tools)
    result = execute_tool(action)
    history.append((action, result))
    if action.name == "finish":
        break

That is the whole thing. A while loop, an LLM call, tool execution, and history accumulation. Every agent from SWE-agent to Aider to Claude Code runs this same basic structure. The variance in capability does not come from the loop design; it comes from what the tools look like, how the agent navigates code, how it edits files, and whether it verifies its own work.

The ACI: Tool Design as the Real Lever

The Princeton SWE-agent paper introduced the term “ACI” (Agent-Computer Interface) to describe the tool layer that sits between the LLM and the filesystem. The insight is that tool design accounts for as much performance variance as model choice. Swapping from a naive file-dump tool to a windowed pager improved SWE-bench scores more than switching between several model tiers.

Consider the difference between two ways to expose file access. A naive version:

{
  "name": "read_file",
  "parameters": {
    "path": "src/main.py"
  }
}

Returns the entire file. For a 2,000-line module, that burns a large chunk of context on content the agent may not need.

SWE-agent’s windowed approach instead provides:

open src/main.py
[File: src/main.py (2000 lines total)]
(Showing lines 1-100 of 2000)
1: import os
2: import sys
...

With scroll_up and scroll_down to navigate. The agent reads only what it needs, preserving context for the actual work. This sounds minor. In practice, on long files, it changes what the agent can accomplish within a single context window.

The same principle applies to command output. A run_bash tool that dumps unlimited stderr will fill the context window with a stack trace from a failing test suite. A well-designed tool truncates at a sensible limit and signals that truncation happened, so the agent knows to ask for more if needed.

File Editing: Why Search/Replace Won

Three strategies exist for file editing in agents, and they have meaningfully different tradeoffs.

Full rewrite has the agent output the entire new file content. This is reliable: there is no ambiguity about what the result should be. But for a 500-line file with a three-line change, you are spending 500 lines of output tokens to express three lines of diff. At scale, this is prohibitively expensive.

Search/replace blocks are now the dominant paradigm. Aider popularized this format:

<<<<<<< SEARCH
def process_items(items):
    for item in items:
        print(item)
=======
def process_items(items, verbose=False):
    for item in items:
        if verbose:
            print(item)
>>>>>>> REPLACE

The scaffolding applies the replacement by finding the exact SEARCH block in the file and substituting the REPLACE block. Output token cost scales with the change size, not the file size. The failure mode is exact-match errors: if the file was modified since it was read, or if the LLM hallucinated whitespace differences, the match fails. Good agents handle this by catching the error, re-reading the file, and retrying.

Unified diff format leverages the fact that LLMs trained on GitHub data have seen millions of diffs:

--- a/src/main.py
+++ b/src/main.py
@@ -42,7 +42,8 @@
 def process_items(items):
     for item in items:
-        print(item)
+        if verbose:
+            print(item)

This requires fuzzy matching to apply, since line numbers shift as edits accumulate. Aider supports this format and includes a fuzzy applicator for exactly this reason. The advantage is familiarity: models that have absorbed enormous amounts of open-source code history are fluent in this format in a way they may not be in a custom SEARCH/REPLACE syntax.

Before an agent can edit anything, it needs to understand where things are. A fresh coding agent faces a codebase it has never seen, with no map.

The naive approach is recursive listing combined with selective reading: ls -R, then read the files that look relevant. This works for small projects and fails for large ones, where the listing alone is several thousand tokens and guessing which files matter is unreliable.

Aider’s repo-map is a more sophisticated solution. It uses ctags to extract all symbols (functions, classes, methods) from the codebase and builds a compressed index of symbol names, their file locations, and their signatures. A condensed version of this map lives in the system prompt. The model can see that process_items is defined in src/pipeline/core.py at line 42 without reading the file. When it needs the full implementation, it reads only that file.

For a reasonably large Python codebase, this map might cost 2,000 to 4,000 tokens but enable the agent to navigate confidently without reading dozens of files speculatively.

Embedding-based retrieval is an alternative that trades latency for broader coverage: embed all code chunks, retrieve the top-k most semantically relevant chunks per query. This scales better to very large codebases but adds retrieval errors and requires an embedding pipeline. Some agents combine both: the repo-map for structural navigation, embedding retrieval for semantic search when the structure-based approach comes up empty.

Context Is the Agent’s Only State

This is the constraint that shapes every other decision. The context window is the entirety of what the agent knows at any moment. There is no side channel, no persistent working memory, no hidden state. Every file the agent has read, every tool result it has received, every decision it has made is either in the context window or effectively forgotten.

This makes context management a first-class engineering concern. Claude Code benefits from a 200k token window, which defers the truncation problem significantly. Agents running on 8k or 32k windows need aggressive strategies: summarizing old tool results, dropping intermediate steps once their conclusions have been incorporated, limiting how many files get fully read.

The conversation history structure matters too. Tool results injected as assistant messages versus system context versus user turns affect how the model weighs them. The specifics are model-dependent, but naive concatenation of tool results often underperforms structured injection.

Verification: The Loop Must Close

An agent that edits code without verifying it is generating plausible text, not writing software. Every production coding agent includes a verification step inside the loop:

# After applying edits
result = run_tool("bash", "python -m pytest tests/ -x -q 2>&1 | tail -20")
if "failed" in result or "error" in result:
    # Feed failure back into context, plan fix
    history.append(("test_result", result))
    # Loop continues
else:
    task_complete = True

The verification does not have to be tests. Linting with ruff or flake8, type checking with mypy or pyright, build verification with cargo build or tsc, all serve the same function: they give the agent falsifiable feedback about whether its changes work. SWE-bench uses test passage as the sole success criterion, which is why agents optimized for that benchmark invest heavily in running tests inside the loop.

The failure mode here is infinite loops on hard problems. A broken test with a subtle logic error can trap an agent in a repair cycle indefinitely. Production systems add step budgets and surface the intermediate state to a human when the budget is exceeded.

The Two-Model Pattern

Aider’s architect/editor mode separates planning from execution into two distinct model calls. The architect receives the task and the repo-map, decides which files to change and what changes to make, and outputs a natural language plan. The editor receives the plan and the specific files, and outputs the edit blocks.

This works because planning and code generation have different requirements. Planning benefits from broad reasoning and understanding of the full task context. Code generation benefits from precision and fluency in the target edit format. A large, expensive model for architecture and a smaller, cheaper model for editing can outperform a single mid-tier model at lower total cost.

The tradeoff is latency: two model calls per iteration instead of one. For interactive use this matters. For batch processing of a task list it often does not.

What SWE-Bench Teaches

The SWE-bench benchmark (500 real GitHub issues requiring code changes plus passing tests) has become the standard measure. The scores from mid-2025 tell an interesting story: well-scaffolded open-source agents using Claude 3.5 Sonnet reached 50-55% on SWE-bench Verified. Claude Code, with its carefully engineered ACI and 200k context window, reached 72% at launch according to Anthropic’s reported figures.

The delta is not explained by the model alone; Claude 3.5 Sonnet is available to other agents too. The scaffolding, the tool design, the context management, and the verification loop are doing real work. That is the argument Raschka’s article is making implicitly, and it is borne out by the data.

Building Ralph, my Discord bot, I work at a much smaller scale than a full coding agent, but the same principle applies: the tool interface your bot exposes to the LLM is where capability lives, not in the prompt text. Getting the tool schemas right, making errors legible, and ensuring the loop has a clean exit condition are the unglamorous work that determines whether the thing actually works.

The agent loop is almost not the point. It is what you put inside it.