· 6 min read ·

The Context Window Is the Architecture: How Coding Agents Actually Work

Source: simonwillison

Simon Willison published a thorough guide to agentic engineering patterns that is worth reading in full. But after going through it and looking at the implementation decisions across Claude Code, SWE-agent, OpenHands, and the OpenAI Agents SDK, one thing stands out as the load-bearing idea behind all the others: the context window is not just a technical constraint, it is the architecture. Everything else is a response to it.

The Basic Loop

A coding agent runs a variant of the ReAct pattern (Yao et al., 2022). The model receives a task, a system prompt, and a list of available tool definitions. It either calls a tool or produces a final response. If it calls a tool, the runtime executes that tool and appends the result to the conversation. The model then receives the updated conversation and continues. Repeat until the model outputs a final answer or a token budget is exceeded.

A stripped-down implementation using the Anthropic SDK looks roughly like this:

import anthropic

client = anthropic.Anthropic()
messages = [{"role": "user", "content": user_request}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=8096,
        tools=tools,
        messages=messages
    )
    messages.append({"role": "assistant", "content": response.content})
    if response.stop_reason == "end_turn":
        break
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = dispatch_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": str(result)
            })
    messages.append({"role": "user", "content": tool_results})

The loop is serialized by default because each step depends on the previous observation. Wall time is dominated by tool execution latency and network round-trips, not token generation. That said, models like Claude support parallel tool calls in a single turn, which lets an agent read three files simultaneously rather than sequentially, reducing both latency and the number of context boundaries between related information.

How the Context Fills Up

Consider a modest bug-fixing session. The agent reads two TypeScript files, runs a grep, applies an edit, and runs the test suite:

[system prompt: ~3,000 tokens]
[user: "Fix the bug in the auth module"]
[assistant: calls Read on auth/session.ts]
[tool result: 400 lines TypeScript, ~800 tokens]
[assistant: calls Read on auth/middleware.ts]
[tool result: 200 lines, ~400 tokens]
[assistant: calls Grep for 'expireSession']
[tool result: 12 matches across 5 files, ~150 tokens]
[assistant: calls Edit on auth/session.ts]
[tool result: edit confirmation, ~50 tokens]
[assistant: calls Bash to run tests]
[tool result: test output, ~500 tokens]
Total: roughly 5,000 tokens for a focused fix

A complex task touching eight files with multiple test cycles runs 25,000 to 40,000 tokens. Reading 20 files averaging 300 lines each consumes around 120,000 tokens in file content alone, before system prompt, tool calls, shell output, and the model’s own reasoning. Both Claude 3.7 Sonnet and GPT-4o support 200k token contexts, but a 200k token context is not a free pass; it is still finite, and the “lost in the middle” research from Stanford and UC Berkeley shows that LLM recall degrades significantly for content positioned in the middle of long contexts. In practice, agents re-fetch files they already read several turns ago.

File Editing Is a Context Problem

Three approaches exist for file modification in coding agents, and the trade-offs are entirely about context efficiency.

Full file replacement is the simplest implementation. The agent reads a file, reasons about changes, outputs the entire new file. It works fine for small files and is trivial to implement. For anything over a few hundred lines, the token cost is prohibitive.

Search-and-replace, with an old_string and new_string, is what Claude Code uses. The tool takes a unique string from the current file and the replacement. It fails loudly when the match is ambiguous, which forces the agent to include enough surrounding context to identify the location uniquely. This is a useful forcing function. It is compact and predictable, and when it fails it fails in a diagnosable way.

Unified diff application is the most compact format, but LLMs generate syntactically incorrect diffs with some regularity: off-by-one line numbers, whitespace mismatches, context lines that no longer match after earlier edits in the same session. Agents using this approach need retry logic or AST-aware validation to be reliable.

The search-and-replace approach dominates in production agents because it balances compactness with failure modes that are both detectable and recoverable.

Shell Access Changes the Problem

The most important capability decision for a coding agent is whether it has shell access. SWE-agent from Princeton NLP made shell access central to its design: the agent clones a repository, reads tracebacks, runs failing tests, modifies source, re-runs tests. SWE-bench results consistently show that agents with test execution capability outperform file-modification-only agents by a significant margin. The edit-run-observe-fix loop requires execution.

The risk depends entirely on the execution environment. An agent running in an isolated container with a cloned repo and no credentials is a different risk category from an agent running in a production environment with network access. Serious deployments run agents in sandboxed VMs. E2B is a common choice for microVM-based code execution. OpenHands lets operators configure sandbox policies explicitly.

From the model’s perspective, having a Bash tool also changes how it reasons about problems. An agent without shell access approaches a codebase as a text manipulation problem. The same agent with shell access starts thinking about running linters, type checkers, and tests as part of its verification loop. The tool set shapes the problem-solving approach.

Tool Descriptions as Interface Contracts

The tool description string is the interface contract between the agent and the tool. A weak description produces weak usage. The difference between these two descriptions for a security review tool is not cosmetic:

Weak:
"Review code for security issues."

Strong:
"Review a provided code block or relative file path for security
vulnerabilities, focusing on OWASP Top 10: SQL injection, XSS, SSRF,
insecure authentication, and broken access control. Returns structured
findings list with severity (critical/high/medium/low), vulnerability
type, and line number. Call this after implementing any feature that
accepts user input, handles authentication, or accesses persistent storage."

The strong description tells the model when to call the tool, what to pass to it, and what to expect back. The weak description leaves all of that inference to the model, which may infer incorrectly or inconsistently. Anthropic’s Model Context Protocol is an open standard for tool definitions that lets tool servers be written once and consumed across different agent runtimes, which makes getting descriptions right a more tractable problem.

Subagents and Context Isolation

When a task grows large enough to risk context exhaustion, the architectural response is to spawn a subagent. Claude Code implements this through a Task tool: the parent agent passes a task description, a new agent instance runs in a completely separate context window, completes its work, and returns only its final output to the parent as a tool result. The subagent’s intermediate steps, file reads, test runs, and edits, never touch the parent’s context.

The OpenAI Agents SDK, released in March 2025, formalizes this pattern: Agent objects can be converted to callable tools via as_tool(), which lets orchestrators treat subagents as black boxes with defined input and output contracts.

The economics of error compounding matter here. If each agent step in a pipeline has a 90% success rate, a five-step pipeline has roughly a 59% end-to-end success rate. Adding subagent layers does not reduce individual step error rates. This means multi-agent patterns should be applied to tasks that are large enough to genuinely require them, with subtasks that have clear success criteria that can be verified programmatically.

There is also a security dimension. In a multi-agent system, an orchestrator implicitly delegates trust to subagents. A crafted comment in a source file read by a subagent can propagate instructions upward through the agent tree. The deeper the agent hierarchy, the more nodes are reachable from a single injected payload. The practical response is the minimal footprint principle: each agent should have only the permissions required for its specific task, not a copy of the orchestrator’s full working state. This is the same principle as OS process privilege separation; it is just less consistently enforced in current agent frameworks.

Where This Leaves You

Building a coding agent that works reliably is a context management problem. The tool loop itself is straightforward. The hard parts are: keeping the context from filling with information the model no longer needs, choosing file editing tools that fail loudly rather than silently, scoping shell access to the actual execution environment risk, writing tool descriptions that produce consistent behavior, and deciding when to isolate work into a subagent versus keeping it in the main context.

The agents that perform well on SWE-bench are not the ones with the most sophisticated reasoning. They are the ones that manage the edit-run-observe cycle with the fewest unnecessary tokens, the clearest tool contracts, and the most explicit handling of failure cases. The context window sets the budget; everything else is about spending it well.

Was this interesting?