· 8 min read ·

Context Is the Only State: The Real Engineering Behind Coding Agents

Source: martinfowler

A few weeks after Rahul Garg’s context engineering piece landed on Martin Fowler’s site in early February, the tooling landscape it described has already shifted again. That pace is worth noting, because the piece was itself a snapshot of a fast-moving target. What it captured correctly, and what has since become clearer, is that the bottleneck in coding agents is not model capability. It is the architecture of what the model can see at any given moment.

The phrase “context engineering” is a deliberate reframe from “prompt engineering,” which accumulated connotations of incantations and magic phrases. Context engineering is more structural. You are not choosing better words. You are deciding what information occupies the context window at inference time, where within that window it sits, and how it remains accurate as the session evolves. These are engineering decisions with measurable consequences.

The Constraint Everything Else Follows From

A coding agent is a loop. At its simplest:

while not done:
    response = llm.complete(messages=history, tools=tool_definitions)
    if response.stop_reason == "tool_use":
        result = execute_tool(response.tool_call)
        history.append(response.tool_call)
        history.append({"role": "tool", "content": result})
    else:
        done = True

The ReAct paper formalized this pattern in 2022. What it makes explicit is that there is no hidden state. No persistent memory between invocations, no background database the model can query. Everything the agent knows at any moment exists in the token stream passed to the next inference call. This is not a limitation that better models will eventually eliminate; it is a property of how transformer inference works.

The practical consequence is that every token in the context competes for the model’s attention. A system prompt for a production agent consumes somewhere between 5,000 and 10,000 tokens before any task-specific content appears. Tool definitions add more. Each tool call adds at least two entries to the history. A session that runs 40 or 50 turns in a complex codebase can accumulate 80,000 to 120,000 tokens, and the model’s attention is not uniform across all of them.

The Liu et al. “Lost in the Middle” paper (Stanford/UC Berkeley, 2023) established that transformer models attend reliably to content at the beginning and end of long contexts, while information placed in the middle degrades substantially in recall. This means a constraint you stated conversationally at turn five is sitting in an attention trough by turn forty. The model did not forget it in any catastrophic sense. It simply weighted it less.

The File Every Coding Tool Independently Invented

Given this constraint, every major coding assistant converged on the same structural solution: a project-level context file that gets injected at session start, where attention is highest. None of these were coordinated.

ToolContext File
Claude CodeCLAUDE.md
OpenAI CodexAGENTS.md
Cursor.cursorrules / .cursor/rules/
GitHub Copilot.github/copilot-instructions.md
Windsurf.windsurfrules

Independent convergence on the same pattern across competing products is reasonable evidence that the problem is real and the solution space is constrained. When Anthropic, OpenAI, Microsoft, and Anysphere all ship the same architectural feature without coordination, the architecture is probably correct.

What belongs in these files is less obvious than it first appears. The instinct is to describe the project: tech stack, directory structure, testing framework. That is useful, but it is not where the leverage is. The higher-value content is constraints that diverge from the model’s training defaults, with the rationale attached.

A model trained on the open internet has strong priors about how code is typically structured. If your project follows common conventions, the model will often infer the right thing without explicit instruction. Where it will go wrong is on deliberate divergences: the team that chose a non-standard ORM, the service that uses an unusual error handling pattern, the codebase where a seemingly obvious refactor was tried and reverted for reasons that are not visible in the code.

## Decisions (do not revisit without discussion)
- [2025-01-15] Chose Drizzle over Prisma: better TypeScript inference,
  no Rust binary dependency, easier local setup for contractors
- [2025-02-03] REST only, no GraphQL: team is small, GraphQL added
  schema maintenance overhead that outweighed flexibility benefits

## Constraints
- Never write directly to the database from route handlers.
  All database access goes through the repository layer in src/db/.
- Do not use console.log. The logger in src/lib/logger.ts routes to
  structured output that the log aggregation pipeline expects.

The rationale matters because the model generalizes from it. “Use the internal logger” tells the model what to do in the specific case. “Do not use console.log because it bypasses structured logging and breaks log aggregation” gives the model the principle, which it can then apply to adjacent cases you did not anticipate.

Keep these files short. Under 500 tokens is a reasonable target. Past that, you are either over-specifying things the model can infer, or you have accumulated cruft that should be pruned.

Advisory vs. Enforced Constraints

CLAUDE.md instructions are advisory. The model follows them most of the time, especially early in a session. Over long sessions, as the instruction drifts toward the middle of context, adherence degrades. For constraints where “most of the time” is not sufficient, you need enforcement at a different layer.

Claude Code’s hook system is what that looks like in practice. A PreToolUse hook runs before tool execution and can block it; a non-zero exit is enforced regardless of the model’s reasoning:

#!/bin/bash
FILE=$(python3 -c "import sys,json; print(json.load(sys.stdin).get('file_path',''))")
if echo "$FILE" | grep -q '/migrations/'; then
  echo 'Blocked: migrations directory requires explicit confirmation'
  exit 1
fi

This is a meaningful architectural distinction. A model that has been told “never modify the migrations directory” in a system prompt will follow that instruction with high probability in most sessions. A PreToolUse hook that blocks migration writes is a guarantee. For autonomous CI contexts, or for constraints where a single violation has serious downstream consequences, the distinction is not academic.

No other major coding assistant currently has an equivalent. This is one of the areas where the tools genuinely differ, rather than converging.

What Survives Context Compaction

Claude Code compacts context automatically when the session approaches the window limit: the LLM summarizes the conversation, the session restarts with that summary, and CLAUDE.md is re-injected at the start of the new context. PreToolUse hooks persist across compaction. Mid-session natural language constraints introduced conversationally do not.

Anthropics’s API also supports prompt caching, where a stable prefix of the conversation can be marked cacheable and reused across turns. For long sessions with large fixed system prompts, this substantially reduces both latency and cost.

The compaction behavior has a concrete implication: if you want a constraint to hold across the full session, it needs to be in CLAUDE.md or a hook, not in conversational instructions. A constraint you state at the start of a session feels durable. After compaction, it is gone unless it was encoded somewhere that survives the restart.

Codebase Navigation: Where the Tools Actually Differ

Context files prime the model before work begins. How the agent navigates a real codebase during a session is a separate problem, and this is where the tools diverge most significantly.

Claude Code’s primary approach is grep and shell commands: fast, language-agnostic, and low token overhead per query. A typical session on a moderately complex codebase requires 15 to 20 tool calls before the model has enough context to make a well-targeted edit. The cost per call is low; the compounding cost over a long session is not.

Aider’s repository map takes a different approach: parse the entire codebase with tree-sitter, extract function signatures and cross-file references, run a PageRank-style algorithm to weight recently touched files, and inject a compact structural overview at session start. The default overhead is around 1,000 tokens, configurable up to 8,000 via --map-tokens. This shifts cost from runtime queries to a one-time upfront investment, and gives the model a structural picture of the codebase before any task-specific navigation happens.

Cursor’s semantic search (@codebase) and Language Server Protocol integration represent a third approach: embed files into a local vector index for semantic retrieval, and use the running language server for precise go-to-definition and find-references queries. The LSP path gives exact answers with no false positives; it fails on mixed-language repositories, non-standard build systems, and missing dependencies. The embedding path handles semantic generalization but introduces index freshness issues and false positives from vocabulary overlap.

The Princeton SWE-agent paper coined the term “Agent-Computer Interface” for the layer of tool design and output formatting that mediates between the model and the environment. Its finding was that changes to tool descriptions and output formatting produced several percentage point swings in SWE-bench scores, independent of the underlying model. Tool design is not incidental to agent performance; it is a primary determinant.

On SWE-bench Verified, which benchmarks against 2,294 real GitHub issues, the gap is substantial. Well-engineered scaffolding around the same underlying models reaches 49 to 60 percent resolve rates in early 2026. Simpler frameworks using the same models land at 18 to 22 percent. The scaffolding explains most of the spread.

The Session-Level Problem CLAUDE.md Does Not Solve

CLAUDE.md is a cross-session artifact. It represents stable knowledge about the project, maintained between conversations. What it does not handle is the decisions made within a single session: the architectural choice you committed to at message five that the implementation at message fifty-five needs to remain consistent with.

Rahul Garg’s context anchoring approach addresses this directly. The idea is a session-scoped document, maintained actively during the conversation, that records decisions and their rationale as they are made:

# Session Context: Payment Flow Refactor

## Decisions Made
### Architecture
- Repository pattern (not service locator)
  - Rationale: service locator obscures dependencies,
    makes PaymentService harder to test in isolation

## Open Questions
- Whether to wrap repository errors in domain exceptions (deferred)

The Open Questions section is worth emphasizing. Unanswered questions left implicit will be answered by the model silently, as implementation details. Making them explicit gives you the chance to decide intentionally. The rationale fields prevent a subtler failure: a model that knows the conclusion but not the reason will generalize incorrectly when it encounters an adjacent case the original decision did not anticipate.

Re-referencing this document explicitly at significant decision points works with transformer attention mechanics rather than against them, pulling the relevant context back toward the high-attention beginning of the effective window.

This is Michael Nygard’s Architecture Decision Records applied in real time, during a session, before decisions harden into code. ADRs address decisions made months apart; context anchoring addresses decisions made minutes apart. The structural insight is the same: decisions undocumented at the moment they are made become invisible constraints that later contributors, human or AI, have to reverse-engineer from the code.

Where This Is Going

The convergence across tools on project-level context files, the emerging discipline around session-level context management, and the architectural work around hooks and compaction all point toward the same conclusion. Context engineering is not a temporary workaround for models that will eventually handle everything automatically. It is the design surface for coding agents, the layer where the behavioral contract between the developer and the tool gets specified.

The model is perhaps 20 percent of the code in a serious coding agent implementation. The scaffolding, the context management, the tool design, the edit format precision are the rest. Getting that layer right is the work.

Was this interesting?