Coding Agents Are Mostly Scaffolding

Simon Willison published a guide on how coding agents work as part of his agentic engineering patterns series. The core architecture is genuinely simple: the model produces text, the host application executes any tool calls in that text, appends the results to the message history, and calls the model again. This continues until the model produces a response with no tool calls. Understanding that loop means understanding the fundamental architecture, and it also means understanding the least interesting part of building a real coding agent.

The Loop, Concretely

while True:
    response = llm.chat(messages)
    if not response.tool_calls:
        print(response.text)
        break
    messages.append(response)
    for call in response.tool_calls:
        result = dispatch(call.name, call.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": result
        })

The model does nothing except produce text; all side effects come from tools. Intelligence and action are cleanly separated. You can swap models without changing tool implementations, and you can add tools without retraining anything. The model learns what tools are available from a description in the API’s tool schema and generalizes from that.

Most coding agents converge on a similar core set of tools:

Tool	Purpose
`read_file(path)`	Load file contents into context
`write_file(path, content)`	Write or overwrite a file
`bash(command)`	Run arbitrary shell commands
`list_directory(path)`	Explore the directory tree
`search_files(pattern, glob)`	Ripgrep-style content search
`web_fetch(url)`	Read documentation or external references

The bash tool subsumes all the others, since you can implement read_file as cat. Typed tools are preferable for auditability: a read_file call in a transcript is unambiguous; a bash call with cat is less clear in an automated review pipeline. Claude Code exposes both, dedicated structured tools for common operations plus a general bash tool for everything else. Aider takes the opposite approach and works primarily through shell commands, relying on the model to produce unified diffs that it applies with patch.

Most of the Engineering Is Not the Model

The non-LLM code is where most of the real engineering happens. The loop, the tool implementations, the retry logic, the context budget management, the error handling, the output parsing: this is where you spend most of your time. The model is one component of a larger system that requires careful engineering to be useful.

Every major coding tool converges on static knowledge injection as a foundational mechanism. Claude Code loads CLAUDE.md automatically at session start, layered from global to repo-level to per-directory. Cursor introduced per-directory glob-scoped rules in v0.43 via .cursor/rules/. GitHub Copilot added .github/copilot-instructions.md in 2024. All of these are markdown files telling the model what the project is and what conventions to follow, and maintaining them well is a form of software engineering.

Testing agentic systems means testing the scaffolding specifically. Unit tests that mock the LLM and verify that specific tool outputs produce specific next actions are more useful for diagnosing most bugs than end-to-end runs. The model’s behavior is not under your control; your scaffolding is.

Reasoning Before Acting

The ReAct pattern (Yao et al., 2022) addresses a concrete failure mode: agents that commit to incorrect tool calls because they have not reasoned about what they expect to find. The pattern interleaves explicit reasoning traces with tool invocations:

Thought: I need to find which test file covers the payment module.
Action: find . -name "*.py" -path "*/tests/*" | xargs grep -l "payment"
Observation: tests/test_payment_gateway.py

Thought: Let me run those tests and see what is failing.
Action: python -m pytest tests/test_payment_gateway.py -v
Observation: FAILED test_retry_logic - AssertionError: Expected 3 retries, got 1

The reasoning step forces the model to articulate what it is trying to accomplish before committing to a command. It also gives the model a reference frame for interpreting observations: if the thought established an expected outcome, a deviation is easier to recognize. Without the reasoning step, models often proceed past wrong results because nothing in the accumulated context explicitly flags them as unexpected.

ReAct is not enforced by any API. You instruct it in the system prompt or rely on fine-tuning. Base models often skip straight to action generation, which is faster but loses the self-correction benefit. For long tasks with many sequential decisions, that difference matters.

Context Management Is the Genuinely Hard Problem

A mid-sized production codebase can have 500,000 lines of code. Even Claude’s 200k token context window cannot hold all of it, and filling context with irrelevant code degrades performance. The model attends to everything in its context, and noise hurts precision. Every coding agent has to decide which parts of the codebase to load, and the three dominant tools have made architecturally distinct choices.

Aider’s repo map uses tree-sitter to parse every file and extract function signatures, class definitions, and relationships, without including function bodies. The entire codebase fits in a few thousand tokens as a structural index. The model gets a bird’s-eye view and requests full file contents only for the files it needs. You lose implementation details but preserve the public interface and dependency relationships of every module.

Cursor’s embedding index retrieves by semantic similarity using a code-aware embedding model, with chunks indexed at build time. Semantic retrieval handles queries like “find where we handle OAuth tokens” well. It handles structural queries like “find all callers of this function” poorly, because two functions can share a domain without being contextually related for a specific task. Cursor supplements the embedding index with traditional code search for structural queries. The index can also go stale between indexing runs, which matters for codebases under active development.

Claude Code’s selective loading maintains no persistent index. The model navigates via tools: list directory, read file, search for patterns, follow imports. It reads files as they exist on disk, so it is always current. The cost is more round trips at task start. On a well-organized codebase with a clear README, it converges quickly. On a sprawling legacy codebase, the exploration phase can consume a large fraction of the context budget before any meaningful work begins.

Placement within the context also matters. Research from Stanford and UC Berkeley demonstrated that LLMs perform measurably worse on information placed in the middle of long contexts. Critical constraints and prohibited patterns belong at the top of any static context file.

Error Budgets Shrink Fast

At 95% per-step reliability, a 10-step task succeeds with probability 0.95^10, which is roughly 0.60. A 20-step task: 0.36. A 50-step task: 0.08. A real bug fix might take forty steps; a refactor might take a hundred. The error budget shrinks as each step introduces a new opportunity for a silent failure, independent of how capable the model is.

The METR study from early 2026 found that a substantial portion of AI-generated patches that pass SWE-bench would be rejected in code review. Tests passing is not the same as code being correct, well-placed, or mergeable. SWE-bench scores climbed from around 3% in 2023 to over 70% for frontier systems by 2025, and those numbers measure a narrower thing than they appear to.

Error recovery design matters significantly for multi-step tasks. Structured error responses produce more predictable recovery behavior than raw exception tracebacks:

{
  "ok": false,
  "error_type": "format_error",
  "tool_name": "events_insert",
  "details": {
    "received": "2026-01-15 09:30:00",
    "expected_format": "RFC3339 (e.g. 2026-01-15T09:30:00-05:00)"
  }
}

Raw tracebacks cause models to retry with slightly different arguments in loops that burn tokens without progress. A structured error narrows the search space for recovery. Research on agentic tool use found that over half of tool-use failures come from malformed arguments or incorrect sequencing, not from selecting the wrong tool. Agents knew which tool to call; they failed at forming valid arguments or ordering dependent calls. That failure mode is addressed by better error feedback and better tool schema design.

Parallel Execution

The standard loop is sequential: one tool call per turn, wait for the result, next turn. For tasks with independent subtasks, that is unnecessary overhead. Both the Anthropic and OpenAI APIs support multiple tool calls in a single turn. The model can emit read-file-A and read-file-B simultaneously, receive both results before its next turn, and cut I/O latency on exploration-heavy tasks. Getting models to use this consistently requires prompt engineering. Claude Code supports parallel tool calls, and for tasks that read many independent files, the effect on wall-clock time is measurable.

Where This Leaves the Field

The tool-use loop is a direct consequence of how language models work. They produce text; they do not execute code. The loop is the minimal architecture that gives a text-producing model the ability to affect state outside itself, and that structure is stable.

What will improve is the scaffolding around it: better context selection strategies, smarter stopping criteria, richer error feedback, and models that are more calibrated about their own uncertainty. The agents that are most reliable today are the ones that fail gracefully, communicate when they are stuck, and do not silently produce plausible-looking wrong outputs. Engineering those properties is a scaffolding problem, and it is where most of the real work in this space is still happening.