The Scaffolding Is the Product: Tool Design in Coding Agents

Sebastian Raschka published a detailed breakdown of coding agent components that’s worth reading if you’re building or evaluating these systems. It covers the agent loop, tool categories, context management, and verification steps with enough specificity to be practically useful. But the piece made me want to go deeper on one particular question: why do coding agents with the same underlying model perform so differently across different scaffolding implementations?

The short answer is that the scaffolding is not plumbing. It is the product. The tool schemas, edit formats, navigation strategies, and error contracts that sit between the model and the filesystem collectively determine what the agent can accomplish, often more than parameter count or training data.

The Loop Itself Is Not the Interesting Part

Every coding agent runs some variation of the same loop: read context, call model, execute tool, feed result back, repeat until done or stuck. The loop structure is not controversial. What varies enormously is what happens at each step.

The perception step, for instance, is not just “read files.” Coding agents have to decide which files to read, in what order, at what granularity, and at what cost to the context window. A naive implementation loads everything relevant upfront and burns tokens before the model writes a single line. A better one stages exploration: first a glob pass to understand structure, then targeted reads of function signatures, then full body reads only where changes are needed. The same model behaves very differently when given a 200-token architecture summary versus 8,000 tokens of raw file contents that are only tangentially related to the task.

The most common tool in any coding agent is grep. It is fast, universal, and easy to implement. It is also frequently wrong in ways that matter.

Consider finding all callers of a function named process. A regex search returns every occurrence of the string process in the codebase, including comments, variable names, string literals, and method names on unrelated types. The model then has to filter that noise while consuming tokens for every false positive. In a large codebase, this is not a minor inefficiency; it actively degrades reasoning quality because the model is spending attention on irrelevant results.

Language Server Protocol operations solve this at the semantic layer. A goToDefinition call resolves the symbol under the cursor to its actual declaration, regardless of naming collisions. findReferences returns only true call sites, not string matches. documentSymbol maps a file’s structure without reading the full body. These operations are available via LSP clients like pylsp, rust-analyzer, and typescript-language-server, and they give an agent navigational precision that grep cannot match.

The cost is setup complexity. LSP servers require a running process, a project configuration, and indexed dependencies. For a general-purpose agent that works across arbitrary codebases, this is a real constraint. For a specialized agent or a developer tool with known environment assumptions, it is a worthwhile investment. The agents that benchmark best on repository-level coding tasks generally use some form of semantic navigation rather than pure text search.

Edit Formats and the Reliability Cliff

How an agent specifies file changes is one of the most consequential design decisions in the entire system. Get it wrong and you get silent corruption; get it right and you get reliable, reviewable edits.

Free-text replacement is the most brittle approach. The model emits a before/after pair, the scaffolding finds the before string in the file, and replaces it. This fails whenever the model’s recollection of the existing code differs slightly from reality, which happens frequently because models summarize rather than memorize. A missing space, a different variable name from a prior edit, a line that changed since the context was loaded: any of these cause the replacement to fail silently or, worse, to match the wrong location.

Line-number-anchored edits are more robust. Specifying a start line, end line, and replacement content reduces ambiguity significantly. The scaffolding can validate that the replaced range contains roughly what the model said it would, reject the edit if the lines have changed, and surface an error that the model can recover from.

Diff-based formats go further. A unified diff encodes both what is expected to be present and what replaces it, and standard patch tools have well-defined semantics for fuzzy matching and rejection. The downside is that generating valid unified diffs is harder for models than generating ad-hoc before/after blocks, and malformed diffs fail noisily. Some implementations use a structured object format instead:

{
  "file": "src/server.ts",
  "edits": [
    {
      "start_line": 42,
      "end_line": 47,
      "new_content": "  const result = await db.query(sql, params);\n  return result.rows;"
    }
  ]
}

This gives the scaffolding maximum latitude to validate, reject, and report failures precisely. It is also the format that produces the most useful diffs for human review, since the line ranges map directly to what changed.

The aider project has documented this extensively in its edit format benchmarks, showing that different formats produce meaningfully different pass rates on coding tasks even when the underlying model is identical. The whole-file format (emit the entire file with edits applied) scores highest on reliability but is prohibitively expensive for large files. Search-and-replace formats score lower. The structured line-based format sits in the middle of both cost and reliability.

Context Budget Management

Context windows have grown large enough that naive agents sometimes just stuff everything in and let the model sort it out. This works until it doesn’t: long contexts increase latency, increase cost, and can degrade attention quality on the specific region that matters.

The better approach treats context as a budget to be spent deliberately. The agent maintains an accounting of what has been loaded and at what granularity:

Directory structure and file list: cheap, always loaded
File headers and function signatures: moderate cost, loaded when a file is relevant
Full function bodies: expensive, loaded only when a change is being made
Test files: loaded early, because tests describe expected behavior more densely than implementations
Recent git history for affected files: loaded when the task involves understanding prior decisions

The cascade matters. An agent that reads the full body of every potentially relevant file before deciding which ones it actually needs to change has already wasted its budget before the first edit. An agent that stages reads based on what each read reveals can handle substantially larger codebases within the same context limit.

Some scaffolding implementations track a “loaded files” state and avoid re-reading files that have not changed. Others compress prior tool outputs when they are no longer directly relevant to the current reasoning step. Both are reasonable approaches to the same problem: context is finite and the task is not.

Verification as a First-Class Step

One thing that separates reliable coding agents from fragile ones is whether verification happens inside the loop or outside it. An agent that makes edits and then exits, leaving the human to run tests and report back, is a very different system from one that runs the test suite, observes the output, and revises its approach based on failures.

The verification tools available in the loop determine what the agent can self-correct. At minimum, a useful coding agent should be able to:

Run a type checker or linter and parse its output
Execute the relevant subset of tests (not the full suite, which is often too slow)
Build the project and capture compiler errors
Detect when a file has been left in a syntactically invalid state

Each of these creates a feedback signal that closes the loop. A type error pinpoints the file and line that needs attention. A failing test names the behavior that regressed. A build error in a compiled language often carries more diagnostic information than a runtime failure.

The tricky part is that not all verification is cheap. Running a full test suite on every iteration is often impractical. Targeted test runs, based on which files changed and which tests cover them, are more tractable. Coverage maps help, but they add setup complexity. Some agents use heuristics: run the test file that lives closest to the changed file, run tests whose names contain the function name that changed. These are imperfect but substantially better than no verification.

The Error Contract

Tool failures are unavoidable, and how the scaffolding represents them to the model is a design decision with real consequences. A tool that returns an empty result on failure is much harder to recover from than one that returns a structured error with context:

{
  "error": "file_not_found",
  "path": "/src/handlers/user.ts",
  "suggestion": "Did you mean /src/handlers/users.ts? (1 similar file found)"
}

The model needs to know what went wrong, why it went wrong, and ideally what the next reasonable action is. Sparse error messages produce confused agents that retry the same failing operation. Rich error messages let the model adjust its approach without burning multiple loop iterations on recovery.

This is not free. Richer error contracts require more implementation work in the scaffolding and more thought about what information is actually useful in each failure case. But the ROI is high because tool failures are frequent in real codebases: files move, imports break, test commands differ by project, build systems require initialization steps the agent has not performed.

What This Means in Practice

If you are evaluating coding agents, the benchmark numbers matter less than understanding the scaffolding decisions behind them. An agent with a weaker model and better-designed tools will often outperform the reverse. The questions worth asking are: what edit format does it use, and how does it recover from edit failures? What navigation tools does it have, and are they semantic or textual? Does it verify its own changes within the loop, and how targeted is that verification?

If you are building one, the implementation order that tends to work is: get the edit format right first, because it is foundational and hard to change later; build staged context loading second, because it is what allows the system to scale to real codebases; add semantic navigation as resources allow; and design error contracts for every tool before you consider them done.

The model will improve on its own. The scaffolding only improves if you build it to.