The Context Problem at the Heart of Coding Agents

A coding agent is, at its core, a loop: a language model at the center, a set of tools it can call around it, and a conversation history that grows with every action and observation. The model reads the current state of the conversation, decides what to do next, calls a tool, gets the result back, and continues. The loop is easy to understand; the hard part is managing what goes into each iteration and what persists across turns.

Simon Willison’s guide to how coding agents work covers this loop well. Getting the loop structure right is the starting point; what dominates the engineering effort from there is managing context.

Tool Definitions

At the API level, tools are defined as JSON Schema objects. When you give a model a read_file capability, you send something like this alongside your API request:

{
  "name": "read_file",
  "description": "Read the contents of a file at the given path",
  "input_schema": {
    "type": "object",
    "properties": {
      "path": {
        "type": "string",
        "description": "Absolute or relative path to the file"
      }
    },
    "required": ["path"]
  }
}

The model returns a tool use block containing the arguments it chose. Your code executes the tool and sends the result back as a tool_result block in the next turn. The conversation history accumulates these tool calls and results alongside assistant text, growing turn by turn.

This is how both Anthropic’s tool use API and OpenAI’s function calling work. The model has no direct access to your filesystem; it asks your code to do things on its behalf. That boundary matters for understanding where execution risk lives and what you can safely expose without sandboxing.

A minimal coding agent needs three classes of tool: file reading, file writing, and shell execution. Everything else is optimization. The question is how well the model manages its own context as it uses them across a multi-step task.

The Context Window Is the Real Constraint

Every tool call and its result gets appended to the conversation. Read twenty files while tracing a bug, and those twenty file contents are now consuming tokens in context. Add bash output from the test suite, the initial error message, the task description, and a detailed system prompt, and you are burning through context budget quickly.

Current frontier models have generous limits. Claude 3.7 Sonnet offers a 200,000-token context window, enough to hold hundreds of files worth of content. But filling it creates distinct problems. Long contexts increase inference cost and latency, and there is evidence that attention mechanisms become less precise as context grows, particularly for information buried in the middle of a long conversation. This “lost in the middle” problem has been documented in long-context retrieval research, and it is a practical concern for any agent that accumulates a lot of tool output over many turns.

Keeping tool outputs minimal matters for reasons beyond cost. Lean context keeps the model’s attention on what is relevant. A tool result that is three lines instead of three hundred is not just cheaper to process; it is more useful.

Strategies Agents Use

This is where implementations diverge. There are several patterns that appear consistently in well-designed coding agents.

Selective loading. Search before reading. Use a grep tool to find the relevant symbol or function, then read only the file and surrounding lines that match. The ReAct pattern, from Yao et al. (2022), formalizes this sequence: reason about what information you need, act to retrieve it, observe the result, reason again. Good coding agents encourage this implicitly through their tool design and system prompts.

Summarization. Some agents compress earlier conversation turns once that content has been processed, replacing raw tool output with a short summary. This keeps the working context small at the cost of some fidelity. For longer tasks, where early context is no longer relevant to the current step, it is usually worth the trade-off.

Front-loaded project knowledge. Claude Code reads a CLAUDE.md file at the start of every session, injecting architectural notes, conventions, and constraints into the system prompt before any tool calls happen. This reduces the need to re-discover the same project facts through tool calls in every session. It offloads context discovery into a document the human maintains, rather than a process the agent repeats on each invocation.

Parallel tool calls. Claude 3 and later models support calling multiple tools in a single turn. A coding agent can read five relevant files simultaneously rather than sequentially, cutting latency and turn count. On large codebases where the agent needs to gather context from several locations before making any changes, this is a meaningful efficiency gain.

When the Agent Pauses

Autonomous operation and human oversight are in tension, and where you draw the line depends on the risk profile of each step.

Reading files carries low risk. The agent can proceed freely. Writing to a file that has been stable for months, or running a shell command that touches configuration, carries more. The model might have the right general direction but a specific wrong assumption that only surfaces two steps later, after more changes have been built on top of it.

Well-designed agents surface this distinction structurally. Shell commands prompt for confirmation by default. Writes to certain paths require explicit approval. The goal is informed oversight, keeping the human aware enough to catch misunderstandings before they compound.

The calibration problem is that too many interruptions turn a labor-saving tool into a click-through wizard. Too few, and the model makes individually plausible decisions that are collectively wrong. Getting this right requires knowing your codebase’s risk surface well enough to configure the agent’s interrupt conditions precisely.

How the Loop Fails

Coding agents fail in predictable ways. The most common: the model misidentifies the root cause of a problem and proceeds through a sequence of plausible actions from that wrong premise. Each step looks locally reasonable, but the direction is wrong. This is the hardest failure to catch because nothing appears broken at any individual step.

The second: context saturation. As the conversation grows long, the model loses track of earlier constraints. It might reintroduce a bug it fixed three tool calls ago, or forget a style convention established in the system prompt. This is a direct consequence of attention degradation at long context lengths.

A third mode is confabulation about tool behavior. The model might run a shell command that would work in a specific environment but fails silently in the current one, then interpret the empty output as success and proceed on a false assumption. Structured tool outputs with explicit success and error fields are more reliable than tools that return raw text the model has to interpret.

The Differentiation Problem

The core loop is standardized across the major coding agents. Claude Code, Cursor, GitHub Copilot Workspace, and Devin all run the same fundamental pattern. The differentiation is in context management, interrupt design, and tool richness.

The trend is toward more specialized tools: AST-aware code search, integrated test runners with structured output, dependency management, and direct hooks into version control workflows. Each new tool extends what the loop can accomplish while adding more potential output to manage in context. The two pressures are in permanent tension.

Understanding the mechanics makes these tools easier to use well. When an agent drifts in the wrong direction, the cause is almost always in its context: something missing, something incorrect, or too much noise competing with the relevant signal. The fix is usually a restart with a cleaner, more constrained context rather than repeating the same instruction. The loop is a reliable foundation; what determines the quality of the output is the quality of the context going into it.