Why Coding Agent Reliability Is Mostly an Information Problem

Most conversations about improving coding agents focus on the model: better reasoning, longer context windows, more training data on code. When an agent produces a wrong edit or spirals into a retry loop, the natural reaction is to blame the model’s capabilities. In most cases the actual cause is simpler. The model had the wrong information, or the right information was no longer in an effective attention range when it was needed.

Simon Willison’s guide to agentic engineering patterns describes the basic architecture that makes this dynamic visible: the tool loop, where every tool call and result accumulates in the context window, and the model reasons over that accumulated state on each turn. The engineering discipline that makes agents reliable is almost entirely about ensuring the model has accurate, relevant information at the moment it needs to act.

What Actually Fails

The edit failure mode that appears most frequently in production coding agents follows a specific pattern. The model generates an old_string from its training-time representation of what a file looks like, rather than from a fresh read of what the file currently contains. The string does not match. The edit fails. The model retries with another string drawn from memory. That fails too. The loop continues until the context fills with failed attempts or the model gives up.

The solution is a better error message combined with a convention that the model reads a file before editing it, not a smarter model. Claude Code’s Edit tool, when it cannot find a matching string, returns the actual current contents of the relevant file section:

Error: old_string not found in src/auth/session.ts.
Current lines 40-45:
  const expiry = new Date(Date.now() + 3600000);
  return { userId, expiry };

This gives the model accurate current state, which is all it needed. The model could write a correct edit once it had that information; the context it was working from was simply stale. The same pattern appears at larger scales: an agent working on a multi-file refactor often fails at step seven because a file was already modified at step four, and the model’s representation of it reflects the state at the beginning of the session.

The “Lost in the Middle” Problem Is Structural

The Liu et al. paper from 2023 documented a consistent finding across transformer models: recall for content positioned in the middle of long contexts degrades substantially, even when the content fits within the nominal window limit. Information at the beginning and end of the context is recalled reliably; information inserted in the middle is not.

For coding agents, this creates a concrete problem. An architecture constraint mentioned in message three of a forty-message session, or a file read in the sixth tool call of a twenty-tool-call run, sits in the worst possible attention position by the time the model needs to apply it. The model has the information; retrieving it from mid-context positions is unreliable.

Agents re-read files they retrieved several turns earlier because the earlier read has drifted below effective attention range. This looks like inefficiency, but it reveals the underlying structural problem: context accumulates, attention degrades, and information placed in the middle of a long session becomes increasingly unreliable as the session grows.

What Aider’s Repo Map Gets Right

Aider’s repository map is a direct engineering response to the position problem. Before the model sees the user’s task, Aider builds a symbol-level index of the entire codebase using tree-sitter: every function, class, method, and variable, with file paths and small surrounding context. This map is injected before the task, at the beginning of the context where attention is highest.

A representative slice looks something like:

src/auth/session.ts:
  class SessionManager:
    constructor(config: SessionConfig)
    createSession(userId: string): Promise<Session>
    expireSession(sessionId: string): void

src/middleware/auth.ts:
  function authMiddleware(req, res, next)
  function validateToken(token: string): boolean

The model forms a plan informed by actual codebase structure. When it needs to edit a specific file, it reads that file fresh. The repo map costs 5,000 to 15,000 tokens for a medium-sized project. The alternative, discovering file structure reactively through tool calls scattered across the session, pays similar token costs while placing structural information in lower-attention positions.

The bet Aider makes is that paying upfront for position is worth it. Codebase structure injected at position zero, where the model attends reliably, is more effective than the same structure discovered piecemeal through the middle of the context. Aider’s architect mode takes this further by splitting reasoning and editing into separate models: a strong model handles planning with full structural context, while a cheaper model handles the mechanical edit application.

CLAUDE.md and the Position-Zero Principle

The convergence of CLAUDE.md, .cursorrules, and GitHub Copilot Workspace instructions on the same pattern is architecturally significant. All three inject stable project-level information at the start of the context, before any task content. They arrived at this independently, but the architecture drives them there: the opening of the context is the most reliably attended position. Stable constraints belong there because they need to survive across dozens of tool calls.

The context anchoring pattern described by Rahul Garg generalizes this into a practice: externalize decisions that need to persist into a document that is re-injected at position zero each session. The document travels alongside the conversation, gets updated as new decisions are made, and is re-read by the model at the point of highest attention.

The practical implication for CLAUDE.md design is that it should contain information that needs to survive a long session: build commands, architectural invariants, coding conventions that apply everywhere. Temporary or feature-specific information does not belong there, because including it adds token cost without providing the stability benefit that position zero gives to universal constraints.

Subagent Isolation as Information Hygiene

Willison’s atom-everything pattern extends the position principle to multi-agent systems. Each subagent invocation receives everything it needs at call time and terminates cleanly. No accumulated context crosses the boundary; the orchestrator passes exactly the relevant information, not the full accumulated history.

This is not primarily about parallelism, though that is a benefit. A subagent receiving a clean, scoped context has all of its relevant information in the highest-attention positions, because its context starts fresh. The degradation that affects information mid-session does not apply to a subagent starting from scratch with precisely what it needs.

The discipline this imposes, deciding exactly what information a subagent needs and passing it explicitly, also functions as a scoping exercise. An orchestrator needing to pass fifteen pieces of context to a subagent is probably delegating something too large. A subagent that needs a diff, a security specification, and two related files is well-scoped. The information requirements make the task boundaries concrete.

Claude Code’s Task tool implements this at the product level: a spawned subagent runs in a completely separate context window and returns only its final output to the parent. The parent’s context accumulates compact results, not the full execution transcript of each subtask.

What the SWE-bench Data Shows

SWE-bench measures coding agents solving real GitHub issues against existing test suites. The original SWE-agent from Princeton NLP solved 12.5% of issues. Current top entries on SWE-bench Verified exceed 45%. The improvements correlate with scaffolding improvements, particularly test execution capability, rather than with model size or architecture changes alone.

The agents with the highest scores can observe the actual state of the codebase after each edit: run the tests, read the failure output, form an accurate picture of current state. Agents that modify files without observational feedback are working from a context that may not match reality by the time they finish. Test execution is, at its core, a mechanism for refreshing information about the actual current codebase state.

This is why adding test execution capability improves scores substantially across different model families. The underlying model capability is similar across these comparisons; what changes is the quality and currency of information available when the model makes each decision.

Building for Information Availability

When a coding agent task fails, the first question worth asking is whether the model had accurate, current information at the relevant decision point. Edit failures, retry loops, and incorrect multi-file edits trace back almost universally to the model acting on stale or missing information rather than to reasoning failures that a stronger model would avoid.

The design choices that address this, error messages that return current file state, upfront repo maps, position-zero injection of stable constraints, subagent isolation that starts each delegation fresh, each address a different aspect of the same underlying problem. Choosing between them is mostly a matter of where the information gap is most likely to appear in a given workflow.

A better model helps at the margins. A model working from current, well-positioned information outperforms a better model working from stale context. The scaffolding around the loop is where reliability is built.