· 7 min read ·

Three Ways to Solve Code Retrieval, and Why Each One Fails Differently

Source: martinfowler

Back in February 2026, Martin Fowler’s team published a taxonomy of what they called context engineering for coding agents. The taxonomy covers three layers: static project memory (CLAUDE.md and its equivalents), dynamic retrieval (how agents gather relevant code), and tool results via MCP. The first and third layers have attracted most of the developer attention, partly because they’re concrete and configurable. The middle layer, dynamic retrieval, is where the real architectural divergence between tools lives, and where the gaps between them are most consequential on real codebases.

The problem every agent faces is the same: a production repository might have 300,000 lines of code across thousands of files, but the model can reason effectively over maybe 50,000 tokens of it at a time, covering perhaps 1,500 to 2,000 lines depending on verbosity. So the system has to select. And the selection mechanism determines the failure mode.

Aider’s Repo Map: Deterministic and Always-On

Aider’s solution is the repo map, generated from an AST parse via TreeSitter that extracts file names, class names, function signatures, and cross-file import relationships, but omits function bodies. For a codebase of a few hundred files, this map might occupy 5,000 to 10,000 tokens. It goes into every prompt, regardless of the current task.

This is a deliberate design choice. Instead of predicting which parts of the codebase are relevant and potentially guessing wrong, Aider bets that the structural skeleton is cheap enough to always include and valuable enough that the fixed token cost is worth paying. With function signatures and import graphs present, the model can reason about where the relevant code lives before reading any of it, then use /add and /drop commands to bring specific files into context explicitly.

For queries that involve precise identifiers, this approach consistently outperforms semantic search. When the task is “fix the bug in handleSessionExpiry”, the repo map tells the model exactly where that function is, what it calls, and what calls it. No embedding needs to fire. The degradation happens at scale: very large codebases produce repo maps that consume a significant fraction of the context budget before any actual code has been read. And for tasks where the relevant logic is deeply nested within function bodies rather than visible from signatures, the map provides structural pointers but not the content needed to reason about the problem.

Cursor’s Embedding Search: Probabilistic and Scalable

Cursor builds a vector index of the codebase and retrieves chunks at query time using embedding similarity. When you describe a task or reference a feature, the system embeds your query and pulls the chunks that score highest in cosine similarity space. You can also force explicit retrieval with @filename references or @Codebase to trigger a full search. The .cursor/rules/ directory (introduced in Cursor v0.43) provides the static layer, with glob-scoped rules for different parts of the codebase.

The appeal is obvious: this scales to codebases where a repo map would be unwieldy, and it can retrieve relevant context without the developer knowing which file contains the relevant code. The failure mode is more subtle. Code retrieval has a different structure than document retrieval. When you search for “the function that handles session expiry,” semantic similarity works well. When you search for “all the places that call handleSessionExpiry,” or “the code that implements the interface UserSession extends,” pure semantic similarity breaks down. The relevant code might share no vocabulary with the query at all; it might simply import the right module.

This is why hybrid retrieval has become the standard for serious implementations. Continue.dev combines BM25, which scores on exact term frequency rather than semantic proximity, with dense vector embeddings, then runs a reranker pass over the merged candidates. The reranker is typically a small cross-encoder that re-scores each candidate using the actual query text rather than precomputed embeddings, filtering to the subset most contextually relevant to the specific task. BM25 handles precise identifier lookups; semantic embeddings handle conceptual queries; the reranker adjudicates between them. The combination handles both use cases meaningfully better than either approach alone, though it adds latency and infrastructure complexity.

Claude Code’s Agentic Retrieval: Adaptive and Unpredictable

Claude Code makes a fundamentally different bet. Rather than pre-computing an index or a structural map, it gives the model tools: read a file, list a directory, run a grep, execute a shell command. The model decides what to retrieve, fetches it, decides what else it needs based on what it found, and repeats until it has enough context to proceed. The context at any point is the accumulated result of the model’s own retrieval decisions.

You can observe this directly with the verbose flag. For a task involving a bug in an authentication handler, a capable model will read the handler file, notice an import from a session management module, read that module, check the interface definition it extends, look at the test file to understand expected behavior, then run a grep to find all call sites. The sequence resembles how a senior developer would track down a bug: following the dependency graph, building a complete picture incrementally before making any changes.

The strength of this approach is adaptability. If the session module has an unexpected dependency on a configuration system, the model can read that too, without any index needing to have anticipated that relationship. There is no stale index to miss an edge case introduced by a recent refactor. The weakness is predictability: you cannot know in advance exactly what the model will read, which makes token consumption harder to reason about and guarantees about context harder to make. The retrieval quality is also directly bounded by the model’s capacity to reason about its own information needs, which varies with model capability and task complexity.

The Token Math That Makes Long Tasks Expensive

All three approaches have to contend with the same accumulation problem in agentic loops. In a ReAct-style execution loop where each step includes the full conversation history, token usage grows quadratically. At step n, the context includes all previous n-1 steps, so total cost across m steps is roughly n * m * (m+1) / 2. At ten steps, that is 55 times the cost of the first step. At twenty steps, it is 210 times. This is why long agentic tasks become expensive faster than the raw context window size suggests, and why /compact in Claude Code exists as a first-class feature rather than an afterthought.

Tavily’s research on their deep research agent found a concrete mitigation: distill tool outputs into compressed reflections at each step, then discard the raw outputs from active context. Instead of accumulating the full text of every retrieved document, the agent produces a structured reflection summarizing what it learned, and only the reflection persists into subsequent steps. Raw data is re-introduced only during final synthesis. The result is linear growth instead of quadratic, and Tavily reported roughly a 66% reduction in token consumption compared to architectures that retain full tool outputs throughout the loop, as documented in their Tavily Deep Research analysis.

The same principle applies to conversation history. Earlier turns become less relevant as a task progresses; compressing them into summaries frees space for new tool results without losing the thread of what was decided. Context budget management is part of context engineering, not a separate concern.

What “Lost in the Middle” Means for Static Context

For the parts of context developers directly control, placement has a larger effect than most realize. Research from Stanford and UC Berkeley established that LLMs perform measurably worse on information placed in the middle of a long context compared to information at the beginning or end. The effect is consistent across model families and context lengths: relevant information buried mid-context produces worse task outcomes than the same information positioned early, even with identical models and context sizes.

For CLAUDE.md files and their equivalents, this has direct implications. The constraints that matter most, the things the agent must never do, the non-obvious architectural rules, belong at the top of the file. A CLAUDE.md that opens with “do not modify files under src/generated/” will reliably enforce that constraint. The same instruction on line 150, after several paragraphs of build instructions, will not. Concision follows from the same finding: a shorter file is one where nothing is buried, and every instruction sits close to either the beginning or the end. The goal is not comprehensiveness but signal density at the positions where the model attends most reliably.

The Knowledge Gap Retrieval Cannot Close

A finding from METR’s March 2026 analysis of SWE-bench-passing patches adds a complicating dimension. Many patches that passed automated test suites would not have been accepted by the actual project maintainers, because tests capture what code must do functionally, but not what kind of code the project wants to be: its idioms, its tolerance for certain dependencies, its architectural principles, the reasoning behind decisions made years ago. This is institutional knowledge that lives in contributor guidelines, in historical code reviews, in the mental models of maintainers, and that resists encoding in a CLAUDE.md because it is not a list of rules but a coherent philosophy about how the codebase should evolve.

No retrieval architecture addresses this gap. Semantic search can find stylistically similar code. A repo map can show structural patterns. Neither can reconstruct the judgment behind architectural decisions. This is where context engineering runs up against a limit that is not technical: context windows can hold information, but not the reasoning that produced it.

The Fowler article frames context engineering as becoming a necessity, and that is accurate. But the retrieval problem at its center is not solved uniformly across tools; it is approached with different architectures that have different tradeoffs. Aider’s deterministic repo map is predictable and fast but hits scale limits. Cursor’s embedding search handles large codebases but struggles with structural queries. Claude Code’s agentic retrieval is adaptive but opaque. The choice is a bet on your codebase’s size, your team’s willingness to manage explicit context, and how much you trust the model to make its own retrieval decisions correctly.

Was this interesting?