The Scaffolding Is the Agent: What Actually Determines Coding Agent Performance

Sebastian Raschka’s breakdown of coding agent components lands at a moment when most developer conversations about coding agents are still fixated on which model powers them. The architecture underneath gets far less attention, and that is where the interesting engineering actually lives.

The core claim worth examining: a coding agent is a system with distinct components, and the model is only one of them. The loop, the tool schemas, the file editing format, the context management strategy, the verification step, these are all first-class engineering decisions. Change any one of them significantly and agent behavior changes significantly, independent of which LLM sits at the center.

The Loop Is Not the Model

Every coding agent runs on some version of an observe-think-act loop. The agent receives input, reasons about what to do, emits a tool call or a response, receives the result back, and repeats. What is easy to miss is that this loop is implemented by the scaffolding around the model, not by the model itself. The model sees a context window that happens to contain tool results and produces text that happens to look like tool calls. The orchestration layer handles the rest.

This distinction matters because it means you can hold the model constant and dramatically change agent behavior by rewriting the loop. You can add approval gates before destructive operations, parallelize independent tool calls, implement retry logic when edits fail syntax checks, or insert summarization steps when the context approaches its limit. None of those require a better model. They require better scaffolding.

Claude Code, for instance, runs a loop that reads user intent, optionally externalizes a plan via a todo list, executes tool calls, observes results, and iterates until resolution or a human checkpoint. Cursor’s Composer mode runs a two-pass architecture: a planner that decomposes the task into steps and an executor that implements each step with tool access. Both are loop designs wrapped around models, and both differ from each other in ways that have nothing to do with their underlying LLMs.

The ACI: The Interface Nobody Designed

The most useful framing to come out of recent coding agent research is the Agent-Computer Interface (ACI), introduced in the SWE-agent paper from Princeton (arXiv: 2405.15793). The analogy to HCI is direct: decades of work went into designing interfaces for humans, keyboards, mice, GUIs, CLIs, with careful attention to what makes those interfaces learnable, efficient, and error-resistant. Almost none of that work was done for agents.

Tools designed for humans are being repurposed for agents. A terminal was not built to produce structured, parseable output for a language model. A file editor was not designed to emit linting feedback on every write. A grep command was not designed to return results calibrated to fit inside a context window without overwhelming it. These tools work, but they work poorly by ACI standards.

The SWE-agent paper made this concrete. They built a custom ACI for their coding agent: a file viewer with line-range support rather than full-file dumps, a stateful editor that ran a syntax check after every write and returned the error inline, and a search tool that returned file-level matches with surrounding context. Then they compared the same base model (GPT-4) using their custom ACI against the same model using standard bash tools directly. The custom ACI won by a large margin on the SWE-bench benchmark. Same model, different interface, substantially better outcomes.

This is the result that should reorient how you think about the space. Tool design is not a detail. It is a primary lever.

What Good Tools Look Like

A few concrete properties distinguish well-designed agent tools from poorly-designed ones.

Error messages need to be actionable. When a file edit fails because the search string was not found in the target file, returning a bare null or a generic error forces the model to guess what went wrong. Returning the actual file content around the expected location, with a message like “search string not found; nearest match at line 47,” gives the model something to reason from. The model cannot fix what it cannot diagnose.

Output verbosity needs to be calibrated. A bash tool that returns the full output of a command with no truncation will fill the context window with noise on any non-trivial command. A bash tool that truncates at a fixed character limit may cut off the error that explains the failure. Good tools truncate intelligently, keeping the tail of output (where errors typically appear) and summarizing the middle.

File editing tools need format stability. The history of edit formats in coding agents is a progression from fragile to robust. Whole-file rewrites are simple but expensive in tokens and dangerous for large files. Unified diff format is compact but models produce malformed diffs at a meaningful error rate, partly because correct diffs require accurate line number tracking across edits. The search-and-replace format, where the model specifies an exact string to find and the replacement text, is more robust because it decouples the edit from line numbers entirely. Claude Code and Aider both converged on variants of this format, and their edit reliability is better for it.

Context Is a Budget, Not a Bucket

Context management is the other major systems problem in coding agents. A real codebase has thousands of files; even at 128K or 200K context window sizes, you cannot load all of it. You have to choose what goes in.

The naive approach is to load everything the agent might need. The problem is that irrelevant content in context actively degrades performance, not just wastes space. Models attend to everything in their window, and noisy context introduces incorrect assumptions that compound across multi-step reasoning. Loading 50 files when 5 are relevant is not neutral; it makes the agent worse.

The emerging approach treats context as a budget to be allocated deliberately. A high-level project summary stays in context permanently. File content is loaded on demand, via explicit read tool calls. When context approaches the limit, old tool results that are no longer relevant are summarized or dropped. Tools like Aider use a repo map: a compressed, tree-sitter-generated representation of the codebase’s symbol graph that fits in a few thousand tokens and tells the agent where things live without loading the things themselves.

Navigation quality is upstream of context quality. An agent that finds the right three files immediately uses its budget well. An agent that reads fifteen files to find those three wastes the budget and introduces noise. Grep-based exact search, glob-based file tree exploration, and LSP-backed semantic navigation (go-to-definition, find-all-references) serve different navigation needs. Most agents start with broad, cheap searches to orient themselves and escalate to more expensive retrieval only when needed.

Verification Closes the Loop

An agent that writes code but cannot verify it is operating with one hand tied. The ability to run tests, linters, and build commands within the loop is what turns a code generator into a coding agent. The agent writes a change, runs the test suite, reads the failure output, reasons about the cause, and attempts a fix. That cycle can run multiple times per task without human intervention.

This is not a new idea conceptually, but implementing it well is non-trivial. The agent needs to know which test command to run, how to interpret test output, when to stop retrying versus escalating to the human, and how to avoid test-fixing loops where it passes the tests by making them less strict. These are loop design and prompt engineering problems, not model capability problems.

SWE-bench scores for top coding agents were in the 50-65% range on the verified subset as of early 2026. The gap between top performers is not primarily explained by model capability differences; it is explained by differences in tool design, context strategy, and verification loop quality.

Where This Points

The pattern that emerges from Raschka’s analysis and from the broader research is consistent: the frontier in coding agent development is systems engineering, not model scaling. The ACI is underinvested relative to the attention paid to model benchmarks. Better file editing formats, smarter context allocation, tighter verification loops, and more informative tool error messages all compound with each other.

For anyone building on top of these systems, the implication is that customizing the scaffolding layer matters more than swapping models. A well-crafted CLAUDE.md or .cursorrules file that gives the agent accurate project context, a clear set of conventions, and guidance on which test commands to run is doing ACI work. It is shaping the interface between the model and the codebase, which is the variable that the research says matters most.