The Compounding Reliability Problem in Coding Agent Tasks

The tool loop at the heart of every coding agent is well-understood at this point: a language model receives a prompt and a set of tool schemas, calls tools, gets results appended to its conversation, and iterates until it produces a final response. Simon Willison’s guide to how coding agents work covers this architecture clearly. What is harder to find in most explanations is the compounding reliability problem that emerges as tasks get longer, and what that means for how you should actually use these tools.

The Step-Level Error Rate

Every action a coding agent takes carries some probability of being wrong: misidentifying a function’s purpose, making an incorrect assumption about how two modules interact, producing an edit with a subtle semantic error. For well-scoped tasks with clear context, frontier models make these mistakes infrequently. But they are never at zero.

Consider a model that executes each step correctly 95% of the time. For a task requiring 10 steps, the probability that all steps succeed is 0.95^10, roughly 0.60. For a 20-step task, that falls to about 0.36. For a 50-step task, which is not unusual for a non-trivial refactor, the probability of zero errors across all steps is approximately 0.08.

This is not a criticism of any particular model. It is arithmetic. Even a hypothetical 99% per-step reliability gives you around 60% success on a 50-step task. The compounding is unavoidable once you have any non-zero error rate.

The 95% figure is not calibrated for any specific real system, and individual steps are not independent. But the directional implication holds regardless of where you set the per-step rate: success probability falls faster than most people intuitively expect as task length grows.

What SWE-bench Actually Measures

SWE-bench is the dominant benchmark for evaluating coding agents, based on the original Jimenez et al. paper. It presents real GitHub issues from open-source repositories and measures whether the agent’s changes cause the associated test suite to pass. The benchmark has become a useful barometer of progress, and the trajectory is striking: scores climbed from roughly 3% in late 2023 to well over 50% for frontier systems by 2025.

But SWE-bench has a structural bias worth understanding. Its tasks are drawn from GitHub issues that were eventually fixed by human contributors, meaning they are generally tractable, reasonably well-scoped, and have unambiguous success criteria in the form of passing tests. The distribution is not representative of the open-ended, ambiguously defined, cross-cutting concerns that fill most real engineering backlogs.

More importantly, passing the test suite is not the same as writing good code. A METR study from early 2026 found that a substantial portion of agent-generated patches that pass SWE-bench’s test suites would not be accepted in actual code review. Tests checked for correctness in one narrow dimension, while reviewers evaluated architectural fit, naming, maintainability, and consistency with the surrounding codebase. A test suite written to verify a bug fix does not exhaustively encode all of those criteria.

The practical implication is that SWE-bench scores, while useful for tracking relative progress, probably overstate real-world performance. The benchmark selects for well-defined, verifiable problems; actual engineering backlogs do not.

Where the Errors Actually Live

HuggingFace research on tool-use failures found that most errors did not come from selecting the wrong tool. The model usually knew which tool to call. The failures came from malformed arguments, incorrect sequencing of dependent operations, and silent failure modes where a tool returned an empty or ambiguous result that the model misinterpreted as success.

This suggests that improving per-step reliability requires attention to tool design, not just model capability. Precise argument schemas reduce malformed calls. Structured error responses that distinguish failure types give the model something actionable to reason about. A write_file tool that silently succeeds even when the target directory doesn’t exist gives the model no signal that something went wrong.

A well-designed tool description narrows the model’s decision space before the call is even made. Compare two descriptions for a file-writing tool:

Weak:   "Write to a file."

Strong: "Write or overwrite the file at the given path with new content.
         The path must be absolute and must point to an existing directory.
         Returns an error if the directory does not exist. Do not use this
         to draft or preview edits; only call this when the content is
         finalized and ready to persist."

The strong description changes observable behavior across hundreds of tool calls by constraining when the model invokes the tool and what it assumes about the outcome. This is what the Anthropic tool use documentation means when it notes that the description field should explain not just what a tool does but when to use it.

The append-only nature of the conversation history compounds all of this. When a tool call fails silently and the model proceeds on a false assumption, that assumption is now permanently in context. Subsequent tool calls are made against a contaminated conversation, and there is no rollback mechanism. The model can correct itself, but only by appending a correction on top of wrong reasoning that is still there influencing attention.

What the Reliability Math Implies Practically

There are several ways to work with this constraint rather than against it.

Short, focused tasks consistently outperform long, open-ended ones. A task that takes 5 steps and has a clear success criterion will complete reliably. A task that requires 40 steps of exploration, hypothesis, and incremental correction is fighting the compounding error rate the whole way. This is not an argument against ambitious tasks; it is an argument for decomposing them into bounded subtasks with verifiable outputs at each stage.

Tests are the most useful feedback signal in the loop. Aider’s benchmark data shows a consistent gap between agent runs that can execute tests after each edit and runs that cannot. Test output is structured, specific to what the agent changed, and actionable. It gives the model immediate ground truth rather than requiring it to reason abstractly about whether an edit was correct. The “observe” step in the ReAct loop is only as good as the information it contains, and a failing test with a stack trace is much higher information than an empty response from a write operation.

Human checkpoints after planning but before execution help disproportionately. GitHub Copilot Workspace generates a written plan before making any edits, allowing a human to review the approach first. This does not reduce the step count in the editing phase; it places a human checkpoint at the highest-leverage point, where a wrong initial direction can be caught before it propagates across a long task.

Front-loaded project knowledge reduces the exploration phase. Claude Code reads CLAUDE.md at session start, injecting architectural notes and conventions before any tool calls happen. Cursor uses .cursor/rules/ files with per-directory scoping. A well-maintained project knowledge file can eliminate several navigation tool calls at task start. Those are steps that no longer need to succeed, and each step removed from the chain improves the overall probability.

The retry cost accumulates faster than expected. At roughly $0.50 to $2.00 in API costs for a complex refactoring task using current Claude Sonnet pricing, a single failed run followed by a retry is still cheaper than a human hour. But when a task requires three or four full retries due to mid-task drift or a wrong initial assumption, the economic case weakens considerably. The argument for coding agents is strongest for bounded, well-scoped tasks where the success probability per run is high enough that retry frequency stays low.

The Implication for Tool Builders

The compounding reliability problem has a specific architectural implication for anyone building coding agents on the Anthropic Messages API or equivalents: per-step reliability is a first-class engineering concern.

Structured error responses that give the model actionable information outperform generic failure messages. Explicit sequencing constraints in tool descriptions reduce dependent-call failures. Tools that fail loudly outperform tools that fail silently. Maximum iteration caps prevent unbounded loops that burn context and tokens without progress. These are scaffolding decisions, not model decisions, and they move the success curve without any change to the underlying model.

The model is not the only lever. The reliability budget is shared between the model and the scaffolding that surrounds it. Improving either side of that equation shifts where the success curve lands on a given task length. Understanding the math makes it clear which side of that equation is cheaper to improve for your specific use case.