· 8 min read ·

Externalizing State Has Always Been How We Solve the Shared Context Problem

Source: martinfowler

The Problem Is Not New, Just Faster

Every organization that has ever grown past a handful of people has hit the same structural problem: decisions get made, reasoning gets distributed across conversations and memories, and by the time someone needs that reasoning it has decayed or vanished entirely. The solutions we built for that problem form a clear lineage: onboarding documents, architecture wikis, decision logs, Architecture Decision Records. Each one is an attempt to externalize shared context so it survives beyond the working memory of the people who originally held it.

AI coding assistants introduce a version of the same problem that is much faster and more severe. Rahul Garg’s context anchoring pattern, part of Martin Fowler’s series on reducing friction in AI-assisted development, is the most direct formal statement I have seen of how to apply that lineage to AI sessions. The solution maps cleanly onto what software engineers already know about stateless systems and external state management.

Why the Context Window Loses Track

To understand why context anchoring works, you need to understand what the context window actually is at the implementation level.

Transformer models (Vaswani et al., 2017) compute attention over all tokens in the context on every forward pass. Each token attends to every other token, and the resulting attention weights determine how much any given earlier token influences the current output. The attention mechanism does not maintain a separate persistent state between turns; everything the model knows about the conversation is encoded in the token sequence it receives, recomputed from scratch at each step.

This has a specific consequence for long conversations. As you add more tokens, earlier tokens do not disappear, but their relative weight in attention computations shrinks. Liu et al.’s “Lost in the Middle” paper (2023) measured this empirically: models perform significantly worse on information positioned in the middle of a long context window compared to information at the beginning or end. The primacy and recency effects are structural, not incidental. They follow from how softmax attention works over positional sequences.

Context windows have grown enormously: GPT-4 Turbo supports 128k tokens, Claude 3.5 supports 200k, Gemini 1.5 Pro reaches 1 million. But a larger window does not solve the attention dilution problem; it shifts the threshold at which dilution becomes a practical problem. A decision stated at message three of a two-hundred-message session is occupying a position that is effectively invisible compared to the last twenty messages, regardless of how large the nominal window is.

The Same Problem in Other Domains

The distributed systems analogy is direct. An HTTP server has no persistent state between requests. The request handling code executes, the response goes out, and the process forgets. Keeping per-user state in process memory between requests was tried; it fails under load, under restarts, and under horizontal scaling. The solution is to externalize session state to a database or cache, re-hydrate it at the start of each request, and commit updates before the response goes out. Stateless processing plus explicit external state management is robust; stateful processing that implicitly accumulates context is fragile.

A transformer’s context window is the same kind of temporary computation space. Each session is a request. The conversation state exists for the duration of that session and is gone when it ends. The attention dilution problem within a long session is the in-session equivalent of the same fragility: state that matters is stored in a medium that does not reliably preserve access to it as the session grows.

Human organizations hit a milder version of the same thing. A team’s collective memory is distributed across individuals, each with their own bounded recall and their own biases toward recent information. Onboarding documentation exists because new team members cannot reconstruct months of prior decisions from observation alone. Architecture wikis exist because institutional knowledge leaves when people do. ADRs, which Michael Nygard formalized in 2011, exist because even teams with full attendance and good documentation struggle to reconstruct the reasoning behind old decisions when the code no longer reflects the original constraints.

All of these are implementations of the same principle: bounded working memory, whether in a transformer or a human brain or an organization, requires explicit external state management to remain reliable across time and scale.

What the Anchor Document Actually Does

Context anchoring externalizes the decision state that would otherwise be distributed across the conversation history. The document captures current constraints, active decisions, what is out of scope, and what has been completed. It gets re-injected at the start of new sessions and at major transition points within a session, ensuring the model’s high-attention zone always contains the decisions that matter most.

A concrete anchor document for a typical backend refactoring task looks like this:

# Project Context

## Current Focus
Refactoring the authentication module to use JWT tokens.
Decision: using RS256, not HS256, because we need public key distribution for microservices.

## Active Constraints
- Do not modify the user table schema — migration freeze until Q2
- All new endpoints must follow REST conventions in /docs/api-standards.md
- The frontend team is on an older API version; maintain backwards compatibility

## Recent Decisions
- Decided against Redis for session storage (too much infra overhead for our scale)
- Using refresh token rotation with 15-minute access tokens

The format is intentionally sparse. The model does not need the deliberation that led to a decision; it needs the outcome and the reason at a level that constrains subsequent code generation. The RS256 vs. HS256 note is there not for completeness but because a model without that context will happily use HS256 when generating the verification code, and correcting it mid-session costs more than including two lines upfront.

The document is also a token budget allocation. Every token in the anchor is a token spent at high attention, near the start of the context. Keeping it compact is not just good discipline; it is a direct trade-off against how much other information the model can attend to reliably within the same session.

Context Anchoring vs. RAG

Retrieval-augmented generation (RAG) is often mentioned alongside context anchoring as a solution to the same problem. The mechanisms are related but operate at different scales and make different assumptions.

RAG retrieves relevant context on demand, typically by embedding a query and fetching the nearest chunks from a vector index. It is designed for situations where the relevant context is too large to fit in the context window, or where the relevant subset varies enough per query that upfront loading is wasteful. The key characteristic is that retrieval happens automatically, based on semantic similarity between the current query and the indexed content.

Context anchoring provides context upfront, manually curated, in a fixed position at the start of the context window. It is not a retrieval system. It does not select what to surface based on the current query. It surfaces what the developer has identified as always-relevant for the current session, guaranteed to be in the high-attention zone from the start.

The practical distinction matters for agentic workflows. RAG retrieval is indexed against the question at the time of retrieval. In a long agent run where the model takes many sequential actions, each retrieval query reflects only the immediate subtask, not the cumulative set of constraints established across the session. An anchor document is not limited by what the current step’s embedding suggests is relevant; it carries the full set of active constraints regardless of what step the agent is on. For constraint enforcement across a long agentic run, manual anchoring is more reliable than retrieval-based injection.

Tool-Level Implementations

Several AI coding tools have converged on the same architectural conclusion independently: stable, project-level constraints should be injected at position zero in every session, before any user message. Position zero is the highest-attention position in the context window, and injecting known constraints there is the highest-reliability way to ensure the model respects them.

Claude Code uses CLAUDE.md, a markdown file at the repository root that is read at the start of every session. Cursor uses .cursorrules (now cursor.rules) for workspace-level injected context. Cline, the VS Code extension, has a similar project context feature. Each represents the same design decision: the developer maintains a document that persists outside any individual session and gets loaded before the model has a chance to form any assumptions.

CLAUDE.md covers the stable layer: conventions that will not change within a project, build and test commands, dependency constraints, architectural invariants. The living anchor document covers the dynamic layer: what is being worked on right now, decisions made in this session, what is out of scope today. The two layers serve different functions and should not be collapsed into one.

The Agentic Amplification

The longer the agent runs, the more the context window fills, and the more any given early constraint competes against a growing body of recent context. A five-turn interactive session has mild attention dilution. A fifty-step agentic task that reads files, executes tests, modifies code, and loops based on output is accumulating context at a rate that makes early constraints genuinely unreliable without active management.

The compounding effect is worth being specific about. In a long agentic run, the model may generate code that violates a constraint from step 4 at step 35, not because the constraint was unclear but because it is now tens of thousands of tokens back in the context. The generated code then becomes part of the context, reinforcing the wrong pattern. The next iteration of the loop generates code consistent with what was just produced, not with the constraint that was stated earlier. The error compounds without any single obvious moment of failure.

Context anchoring at transition points within a long run, not just at the start, addresses this directly. Re-injecting the anchor document when the agent finishes a major subtask and begins the next keeps the active constraints in the high-attention zone throughout the run rather than only at initialization.

What Makes the Pattern Durable

Context anchoring is not a prompt engineering trick. It is a correct application of external state management principles to a system with bounded, ephemeral, lossy working memory. The fact that it looks like a simple markdown file should not obscure the architectural reasoning behind it.

The pattern works because it addresses the actual mechanism of failure: attention weight over distance in a transformer’s context window. It works because software engineering has decades of practice with the stateless-plus-external-state pattern in distributed systems. And it works because human organizations have been solving the same basic problem of bounded institutional memory for as long as teams have needed to coordinate across time.

The tooling will continue to improve. Models will get longer context windows with better long-range attention. Automated memory systems will get more reliable. But the underlying principle, that consequential state needs to be explicit and managed deliberately rather than left implicit in a medium that does not preserve it reliably, is not going away as models scale. It is the reason onboarding documents exist even in organizations with comprehensive internal wikis. Implicit context is always the weakest link in any system where reliability matters.

Was this interesting?