Training vs. Scaffolding: Where Coding Agent Capability Actually Comes From

The separation between “model capability” and “scaffolding quality” is less clean than most coding agent discussions make it sound. Sebastian Raschka’s breakdown of coding agent components covers the standard architecture clearly: the loop, the tools, the navigation strategies, the file editing formats. But underlying all of it is a question most system design discussions skip: how does the model know what to do with these tools in the first place, and what can training give you that scaffolding alone cannot?

The answer matters practically. When a coding agent fails on a complex refactoring task, whether you should improve your tool descriptions and context management or switch to a bigger model depends on correctly attributing the failure. Getting this wrong is expensive.

The Function Calling Primitive

Modern LLMs acquire tool-use capability through training, not prompt engineering. The foundation is the function-calling APIs that providers expose. These APIs serialize tool definitions as JSON schema alongside the conversation, and the model learns during fine-tuning to emit structured tool calls in response to contexts where tool use is appropriate.

A base language model has no inherent concept of tool use. You could describe tools in a system prompt and ask the model to output JSON, and a sufficiently capable model would do something reasonable. But the models that power coding agents have been specifically fine-tuned on examples of correct tool-use behavior: when to call a tool versus when to reason further, how to formulate parameters, how to interpret results, and how to chain multiple calls toward a goal.

OpenAI documented part of this in their InstructGPT paper: RLHF training on human feedback shapes model behavior beyond what supervised fine-tuning alone achieves. For tool use specifically, feedback on multi-turn trajectories teaches models to prefer calling a search tool when they need to locate code over hallucinating a file path, or to re-read a file before editing when uncertain, even when no system prompt instruction says to.

Anthropic’s Constitutional AI approach similarly tunes models toward behaviors like asking for clarification before irreversible actions and preferring reversible operations. These are agentic safety behaviors that surface in coding agent contexts without any scaffolding enforcement.

What Training Teaches That Prompts Cannot

Several capabilities emerge reliably from training but are unreachable through scaffold engineering.

Multi-step planning from incomplete context. A well-trained coding model, given a bug report and a repository map, will correctly identify that the fix probably requires changing both the implementation and the tests without being told to look at the tests. This comes from exposure to thousands of similar debugging sessions in training data. Scaffolding cannot inject this pattern-recognition because it does not have the bug context until the model surfaces it.

Calibrated uncertainty about what it knows. Models trained on code-adjacent data learn to distinguish between “I know where this function is defined” and “I should search for where this function is defined.” A model expressing appropriate uncertainty about an unexplored codebase is not following an instruction; it has internalized a prior about what it can and cannot know from incomplete information.

Code correctness priors. When generating code, trained models apply statistical knowledge about which patterns are correct for a given context: which APIs take callbacks versus return promises, whether a particular function signature implies null safety, what argument order a library function expects. These priors explain why larger models fail less on editing tasks even with identical scaffolding. The training corpus contains more correct examples per context type.

Robustness to malformed tool results. Scaffolding can truncate tool output and format it cleanly, but a tool returning partial results, an unexpected error format, or an empty result when results were expected requires the model to reason about what went wrong. Models with more agentic training recover from these states more gracefully because they have seen analogous recovery patterns in training. No schema description teaches this.

What Scaffolding Contributes That Training Cannot Fix

The converse is equally important. Training generalizes across contexts but cannot account for the specific affordances of a specific tool inventory. A model trained on millions of coding sessions has never seen your particular str_replace_editor tool with its specific old_string/new_string contract and its exact error messages. It learns those in context, from the tool description.

This is why tool schema descriptions function as behavioral instructions rather than documentation. When Claude Code’s Edit tool description says “Always read a file before editing it if you haven’t in this session,” that instruction is doing real work. It compensates for the model’s default prior, which would sometimes skip the re-read when context seems sufficient. The scaffold cannot enforce this programmatically without expensive and brittle pre-call validation. The description text is the practical enforcement mechanism.

Output truncation, context compaction, and file state tracking are engineering problems that training cannot solve in advance, because they depend on the specific token budget of the deployment environment and the specific task being run. A model trained at Anthropic has no knowledge of how many tokens your multi-step refactoring session will accumulate before hitting your context ceiling.

There is also the failure-mode visibility problem. Claude Code’s string-replacement approach returns a hard error when old_string does not exist in the file. That error message is scaffolding behavior, not something the model produces. It forces the model to diagnose and recover in the next turn. Without it, the model might proceed as if the edit succeeded and produce subsequent edits that assume a change that never happened. The scaffolding can be more reliable than the model about certain facts, like whether a string literal appears exactly once in a file.

The SWE-bench Numbers Make This Concrete

The SWE-bench Verified leaderboard makes the interaction measurable. When the same model runs under different scaffolding configurations, scores swing by 15 to 30 percentage points. That is the scaffolding contribution. The gap between models with equivalent scaffolding, such as Claude 3 Haiku and Claude 3.5 Sonnet running on identical harnesses, is consistently large. That is the training contribution.

The two variables appear roughly independent. A well-scaffolded weaker model can exceed a poorly-scaffolded stronger model. A well-scaffolded stronger model outperforms a well-scaffolded weaker model. The interventions are additive rather than substitutable.

The SWE-agent paper from Princeton documented this explicitly. Their team spent significant time on tool design and achieved several percentage point improvements before hitting a ceiling that required model-level improvements to push through. They framed this as the Agent-Computer Interface problem: the tooling shapes what the model can express, but the model’s capability determines whether it can use the interface well.

Aider’s benchmarking of edit formats across GPT-4 and Claude models shows similar patterns: the same model on a better edit format performs several points higher, but a stronger model on any format outperforms a weaker model on the best format once the quality gap is large enough.

The Training Data Feedback Loop

A less-discussed constraint is that models learn coding agent behavior from training data about coding agents. As these systems become more common in public repositories, issue trackers, and documentation, the training corpora for future models will contain more examples of agent interaction logs, scaffold error recovery patterns, and multi-step debugging sessions. Models trained on these corpora will start with stronger agentic priors, reducing the gap that scaffolding must compensate for.

This is a feedback loop with ecosystem implications. Well-documented, open systems like Aider, whose edit format and agent loop are extensively described in its public documentation, contribute to future training data in ways that make future models more capable of using similar patterns from initialization. The tools that accumulate public documentation and discussion get baked into model priors earliest.

The same effect applies to agent frameworks like LangGraph and smolagents. As more developers write code using these frameworks, and as that code ends up in training corpora, models become better at using them without extensive scaffolding support.

Attributing Failures Correctly

If you are building on coding agents, model selection and scaffolding design address different failure modes. When agents fail on tasks requiring broad codebase understanding, multi-step reasoning about dependencies, or recovery from ambiguous errors, that is usually a model capability problem. Improving the system prompt will not fix it; the model needs the underlying pattern recognition that comes from training.

When agents fail because they read the wrong files, exceed context limits, or produce edit blocks that do not match file contents, that is scaffolding. The model may be fully capable of the task but constrained by its environment.

The architecture Raschka describes in his article is the right foundation. The loop, the tools, the navigation strategies, and the editing formats are all real components with real engineering decisions behind them. Understanding which layer of it is responsible for a given failure is the skill that makes systematic improvement tractable, and it requires holding both the training and scaffolding dimensions in mind simultaneously.