· 8 min read ·

The Error Budget Every Coding Agent Has to Spend

Source: simonwillison

Every coding agent, from Claude Code to Aider to Devin, runs the same fundamental loop. The model receives a growing message history, produces a response, the host executes whatever tools were requested, and the results go back as the next user turn. Simon Willison’s guide on agentic engineering patterns documents this loop with precision, and it is useful as a reference for anyone building in this space. What the architecture diagrams don’t surface directly is a consequence that shapes every scaffolding decision: the compounding error budget.

Take a conservative per-step success rate of 95%. A single bug fix might require 10 steps. A feature addition might require 40. A multi-file refactor might require 100. The compound probabilities:

  • 10 steps: 0.95^10 ≈ 0.60
  • 40 steps: 0.95^40 ≈ 0.13
  • 100 steps: 0.95^100 ≈ 0.006

A task that looks tractable at step 10 becomes nearly impossible at step 100. A task that feels routine at 60% success on your first 10 attempts collapses to a coin flip by step 20. This is not a theoretical concern. SWE-bench, which measures whether AI-generated patches resolve real GitHub issues as verified by each repository’s own test suite, showed frontier systems scoring over 70% by 2025 on a benchmark where median issue resolution takes fewer than 20 agent steps. A METR study from early 2026 then found that a substantial fraction of those passing patches would be rejected in actual code review. Tests passing is a narrower thing than code being correct, and SWE-bench scores measure the narrower thing.

The implication is that per-step reliability determines the practical task complexity ceiling, and that ceiling is lower than it looks.

Tool Schemas Are Behavioral Specifications

The obvious lever is model quality. Better models make fewer per-step mistakes. But framing it that way positions scaffolding engineers as passive consumers of model improvements and misses where the actual leverage is.

A tool definition sent to the API consists of a name, a description, and an input_schema. The model has no built-in understanding of any tool; it infers correct usage entirely from the description field on every single API call. This makes tool descriptions behavioral instructions delivered on every request, not documentation written once for developers:

{
  "name": "write_file",
  "description": "Write or overwrite the file at the given path with new content. Use this only when you are ready to persist a change. Do not call this to draft or preview edits, only when the content is finalized.",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "Absolute path to the file. Must begin with /."
      },
      "content": {
        "type": "string",
        "description": "The full new content of the file"
      }
    },
    "required": ["path", "content"]
  }
}

The path description requiring an absolute path starting with / eliminates an entire failure class where the model constructs relative paths that break when the agent’s assumed working directory is wrong. That one sentence change removes a failure mode without touching any model weights or fine-tuning. The description on write_file itself, specifying that it should only be called when content is finalized, reduces the rate at which the model calls the tool prematurely and then has to overwrite its own output two steps later.

The same logic applies to the Edit tool that Claude Code exposes. Rather than a general write, Edit takes an old_string/new_string pair:

{
  "name": "Edit",
  "description": "Replace a specific string in a file with new content. Fails explicitly if old_string is not found or is not unique in the file.",
  "input_schema": {
    "type": "object",
    "properties": {
      "file_path": { "type": "string" },
      "old_string": { "type": "string", "description": "The exact text to replace. Must be unique in the file." },
      "new_string": { "type": "string", "description": "The replacement text" }
    },
    "required": ["file_path", "old_string", "new_string"]
  }
}

The tool fails explicitly if old_string is not found. An explicit error is more recoverable than a silent partial edit: the model receives a clear tool result, narrows its recovery options to finding the correct target string, and proceeds. A line-number-based alternative fails silently if the model miscounts lines or if the file changed between the read step and the write step, producing a harder-to-diagnose regression.

Error Responses Set the Width of the Recovery Path

When a tool call fails, the model receives the error as a tool result and decides what to do next. A raw exception message leaves the recovery search space wide open:

Error: file not found

A structured error narrows it:

FileNotFoundError: /src/auth/middleware.py not found.
Nearby files: /src/auth/middlewares.py, /src/auth/middleware_v2.py

With the structured response, the model has a one-step recovery path. Without it, the model might try to list the directory, read the parent directory, or check git history, adding two or three unnecessary steps that each carry their own failure probability and consume context. Research from HuggingFace on tool use failures found that over half came from malformed arguments or incorrect sequencing rather than selecting the wrong tool. The model knew which tool to call; it failed at forming valid arguments. Structured error responses with correction hints address exactly this failure class.

The same applies to date format errors, type mismatches, and permission violations. A bare ValueError tells the model something went wrong. A response of expected RFC3339 format (e.g. 2026-01-15T09:30:00-05:00), received 2026-01-15 09:30:00 gives the model the correction on a silver platter.

Context Accumulation Degrades Effective Attention

Every tool result gets appended to the message history. A 2,000-line file read, a test suite run with 500 lines of output, and a few exploration steps can consume tens of thousands of tokens before the model has written a single line of code. Context window sizes have grown: Claude 3.7 Sonnet supports 200K tokens, Gemini 1.5 Pro supports 2M. But a mid-sized production codebase can have 500,000 lines, and filling even a 200K window with irrelevant code degrades performance.

Research from Stanford and UC Berkeley (arXiv:2307.03172) demonstrated measurably worse performance on information placed in the middle of long contexts. Critical constraints belong at the top of any static context file, not buried after architecture notes. Claude Code reads CLAUDE.md at session start; Cursor uses .cursor/rules/; GitHub Copilot reads .github/copilot-instructions.md. All three serve the same purpose: front-load project constraints and conventions so they sit at high-attention positions in the context.

Different agents take different approaches to managing context growth during exploration. Aider’s repo-map uses tree-sitter to extract function signatures and call relationships from every file, producing a structural index of the whole codebase in 2,000-5,000 tokens without including implementation bodies. The model sees a bird’s-eye view and requests full file contents only when needed. Cursor maintains a vector index and retrieves semantically similar chunks automatically. Claude Code’s approach is purely tool-driven: list directories, read files, search for patterns, follow imports. No persistent index, always current, more round trips at session start.

Each trades something. Aider’s repo-map loses implementation detail but preserves public interface and call relationships. Cursor’s vector index goes stale between indexing runs and handles structural queries poorly; grep is more reliable for call-graph traversal. Claude Code’s tool-driven approach costs more context budget during exploration on sprawling legacy codebases.

Why Code Is an Unusually Forgiving Domain

The error budget math is sobering, but coding agents succeed at rates that would be impossible for general-purpose agentic tasks. Four structural properties of the code domain explain this.

Executable ground truth exists. When the agent writes a function, you can run it. The ReAct loop formalizes the Reason-Act-Observe cycle, and the Observe step has genuine informational density in code: compiler errors name the file and line, test failures include stack traces, lint warnings describe the exact problem. A general agent observing web search results gets variable, ambiguous signal; a coding agent observing a pytest failure gets AssertionError: Expected 3 retries, got 1 at a specific file and line number.

Git makes mistakes cheap. Most coding operations can be undone with git restore. Aider commits before every change by default. The blast radius of any agent step is bounded structurally by version control, before writing a line of scaffolding code. Compare this with the agents that sent emails, created calendar events, and posted to external APIs in the first-generation general agents of 2023: irreversible actions where errors compounded without recovery paths.

Code is a closed world. The meaning of a Python file is fully determined by its contents and imports. Reading the relevant files gives you all the information you need. There is no implicit social context, no organizational history living only in someone’s memory. This completeness guarantee is why a structural index like Aider’s repo-map works at all. An equivalent map for a competitor analysis task or an organizational decision process would not carry the same guarantee.

Feedback is fast and structured. Compilation and testing complete in seconds. The feedback loop between agent action and ground-truth signal is tight enough that many tasks produce a useful Observe step before context accumulation becomes a problem.

Where the Leverage Actually Is

The minimal scaffolding for a working coding agent is 100-200 lines of code: an LLM client, a handful of tool implementations, and the loop. The interesting engineering starts after this baseline.

Claude Code exposes structured tools (Read, Write, Edit, Glob, Grep) alongside a general Bash tool. The structured tools exist for auditability, not capability. Both Read and bash cat read a file, but a Read call in a conversation transcript is unambiguous to a review pipeline while bash cat requires parsing. The system prompt steers the model toward structured tools while keeping Bash available for everything that lacks a dedicated tool.

For production environments with the Bash tool, E2B runs bash commands inside isolated microVMs. The tradeoff is straightforward: an agent with unrestricted shell access on a developer’s local machine can touch anything the local user can touch.

Parallel tool calls, supported in Claude 3.5 Sonnet onward, compound positively rather than negatively. Three independent file reads that previously required three sequential round trips, each with independent failure probability, now execute in one round trip. Wall-clock time drops to the slowest single read; compound failure probability drops from three multiplied independent probabilities to one. Correct implementation requires matching each result to its tool call ID; conflating results from simultaneous reads produces hard-to-diagnose downstream errors, but the mechanism is straightforward.

Getting from SWE-bench scores of 3% in 2023 to 70%+ in 2025 involved model improvements, but also years of scaffolding refinement: better tool schemas, better context management, better error feedback, better stopping conditions. The scaffolding surrounding the model carries as much weight as the model itself on any task long enough for the error budget math to bite.

Was this interesting?