A Coding Agent Is a Compounding Reliability Problem

The loop inside a coding agent is short enough to fit on a napkin. Read the task. Ask the model what to do. Execute the tool call. Feed the result back. Repeat until done. Sebastian Raschka’s breakdown of coding agent components maps this architecture clearly for anyone who wants to build or evaluate these systems. What the survey format doesn’t capture is why the components are designed the way they are, and why the same model can produce dramatically different outcomes depending on how the scaffold is built.

The organizing principle behind every component decision is compound reliability. At 95% per-step reliability, a 20-step task succeeds about 36% of the time. At 99%, the same task succeeds 82% of the time. Real SWE-bench instances, drawn from 2,294 actual GitHub issues across well-maintained Python repositories, often run 40 to 100 steps. The math gets unforgiving fast: 0.95^50 is 8%; 0.99^50 is 60%. A four-percentage-point improvement per step yields roughly an 8x improvement in success rate on longer tasks. Every component in the agent architecture is a mechanism for closing that gap.

How tool schemas encode behavioral rules

A tool schema looks like an API specification: name the tool, describe it, list parameters with types. In practice it functions as a behavioral contract that the model reads on every inference call throughout a session. The description field is not for human documentation; it is an instruction that runs continuously.

Compare two versions of a read_file schema:

// Minimal
{
  "name": "read_file",
  "description": "Read the contents of a file",
  "input_schema": {
    "type": "object",
    "properties": { "path": { "type": "string" } }
  }
}

// Behavioral
{
  "name": "Read",
  "description": "Read a file. For large files, use start_line and end_line to limit output. Always call this before editing a file you have not read in this session.",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": { "type": "string", "description": "Absolute file path" },
      "start_line": { "type": "integer", "description": "First line (1-indexed). Omit for beginning." },
      "end_line": { "type": "integer", "description": "Last line, inclusive. Omit for end." }
    }
  }
}

The second schema embeds a constraint: always read before editing. This sentence is not extra documentation; it is a per-step reliability intervention that runs at every turn for the entire session.

The 2024 Princeton SWE-agent paper coined the term Agent-Computer Interface by analogy with Human-Computer Interface, and its central empirical finding was that changes to tool descriptions and output formatting produced larger swings in SWE-bench performance than changes to the underlying model. Tool schema design is not configuration; it is the primary layer at which agent reliability gets engineered.

File editing is where reliability diverges most sharply

The single most consequential component decision in a coding agent is how it applies edits to files. Every approach is a different bet on where failures will occur and how recoverable they will be.

Unified diffs (the ---/+++ format) are compact and models trained on GitHub data have seen them extensively. The failure mode is that language models hallucinate line numbers. A wrong line number in a hunk header causes silent misapplication or a failed patch with an opaque error message. Both outcomes require recovery steps that consume context and add new failure opportunities.

Aider’s search/replace blocks sidestep line numbers entirely:

<<<<<<< SEARCH
  const expiry = new Date(Date.now() + 3600);
=======
  const expiry = new Date(Date.now() + 3600 * 1000);
>>>>>>> REPLACE

The match is content-based. No line number can be wrong. Aider’s edit format benchmarks show this format outperforms unified diffs consistently across GPT-4 and Claude models on actual editing tasks. The residual failure mode is fuzzy matching: when the exact string isn’t found, difflib.SequenceMatcher finds the closest match, which can silently land in the wrong location if the pattern appears in similar form elsewhere in the file.

Claude Code’s str_replace_editor uses an old_string/new_string JSON approach with a hard uniqueness requirement. If old_string appears more than once, the call is rejected with an explicit error. If it isn’t found at all, the error includes the current content of the surrounding lines so the model can self-correct in the same turn. There is no fuzzy fallback path. A single-character mismatch from a stale memory of the file produces a clear, correctable error rather than a silent wrong edit. The trade-off is that Aider’s fuzzy fallback recovers from some classes of model imprecision that Claude Code’s approach rejects outright.

The read-before-write constraint connects these formats. Files change during a session as earlier edits land. An old_string built from a file read five turns ago may not match the current state. The behavioral instruction embedded in the tool schema enforces the read as a precondition, not as a suggestion.

A fourth approach, Cursor’s Instant Apply, separates concerns entirely: the primary reasoning model describes the intended change at a high level, and a separate smaller model generates the actual file edit. This keeps the reasoning model focused on logic rather than code mechanics, at the cost of coordination overhead and the latency of two sequential model calls per edit.

Context window economics constrain everything

A coding agent’s working memory is the message history, and that history fills faster than the token limits suggest. A production system prompt runs 5,000 to 10,000 tokens. Tool definitions add more. A single 1,000-line Python file read as a tool result is 4,000 to 6,000 tokens. Ten files brings the total to 40,000 to 60,000 tokens before the first edit. Hard SWE-bench instances routinely approach 200k-token limits.

This matters for reliability because of what the 2023 Stanford/UC Berkeley lost-in-the-middle study documented: models perform measurably worse on information positioned in the middle of long contexts compared to information at the beginning or end. Constraints introduced early in a session are more reliably followed than constraints introduced mid-session and then buried under turns of tool call results. This is why CLAUDE.md and system prompts load at session start, and why hard constraints that need to survive long sessions belong in scaffolding hooks rather than in natural language instructions that get diluted as context grows.

Context management strategies each carry reliability trade-offs. Simple truncation drops old messages; it’s cheap, but the model retries approaches it has already failed and abandoned. Summarization (Claude Code’s compaction) restarts the session with an LLM-generated summary, preserving the gist of prior work but not the exact wording of mid-session instructions. Plan-first approaches, like Copilot Workspace, generate a structured plan before executing, loading only plan-relevant files and avoiding context bloat from exploration, at the cost of the flexibility that an observe-act loop provides.

Error responses are reliability engineering

An agent that receives a vague error message must either retry blindly or spend additional tool calls diagnosing what went wrong. Both paths add steps, and each step is another opportunity for failure. Structured error responses that identify the specific problem and show the current state of the relevant content allow single-turn self-correction:

{
  "ok": false,
  "error_type": "not_found",
  "details": {
    "searched_for": "const expiry = new Date(Date.now() + 3600);",
    "current_lines_40_45": "  const expiry = new Date(Date.now() + 3600000);\n  return { userId, expiry };"
  }
}

Research on agentic tool use found that over half of tool-use failures stem from malformed arguments or incorrect sequencing, not from calling the wrong tool. The model knew what action to take; it failed at forming a valid argument or ordering dependent calls correctly. An error message that surfaces the current file state allows the model to restate old_string correctly without issuing an additional read call. The error response itself is a reliability mechanism, not just a diagnostic.

What the SWE-bench numbers actually show

Early GPT-4 baselines with minimal scaffolding resolved 2 to 4 percent of SWE-bench instances. The first SWE-agent implementation brought this to roughly 12 to 14 percent. Current state-of-the-art with carefully engineered scaffolding sits at 50 to 70 percent on the Verified subset. The underlying models improved substantially over this period, but the gap between naive scaffolding and engineered scaffolding on the same model exceeds 30 percentage points.

That gap is the compound reliability effect made visible. It doesn’t come from any single component. It accumulates across tool schema precision, edit format failure modes, context management strategy, error response quality, and dozens of smaller decisions about how tool output is truncated, formatted, and fed back. Each component that moves per-step reliability by one or two percentage points looks modest in isolation. Multiplied across 50 steps, the difference is between an agent that finishes tasks reliably and one that mostly doesn’t.

The survey of components is useful. Understanding why each component is shaped the way it is requires thinking about what it is trying to prevent, and how often those preventable failures would otherwise occur.