Before the Context Fills: Why Design Alignment Has to Come First in AI Development

The argument Rahul Garg makes in his design-first collaboration piece on Martin Fowler’s site is not new in spirit. Senior engineers have said “whiteboard it first” for decades. But LLM-assisted development has a specific set of mechanical properties that make front-loaded design more consequential than it was when working with human colleagues, and those mechanics reward attention.

The conventional wisdom version: AI generates plausible-looking code fast, so developers skip the planning phase to capture that speed, then lose time reconciling what was built with what was needed. The mechanical version is stranger and more specific.

Context Position Is Architecture

Research from Stanford and UC Berkeley established what practitioners call the “lost in the middle” effect. Liu et al. tested retrieval accuracy across model families and found a U-shaped performance curve: relevant information at the beginning or end of a long context is recalled significantly more reliably than the same information buried in the center. The degradation holds across model sizes and architectures.

For design context, the implication is direct. A CLAUDE.md or .cursorrules file injected at session start occupies structurally privileged territory. It sits at the front of the attention window before any task content, any retrieved files, any tool call results. As the session progresses and the context fills, every design instruction from session start slides deeper toward the attention trough. By step twenty of an agentic task in a mid-size TypeScript monorepo, retrieved file contents alone can consume 50,000-80,000 tokens, and the design constraints that were injected first have been pushed to exactly where the model’s attention is least reliable.

The correction loop that feels natural in practice, where you notice the AI going in the wrong direction and add clarifying instructions, faces a structural disadvantage. Mid-session corrections live in the conversation body. They do not share the architectural privilege of the system prompt or instruction file. When Claude Code’s /compact command summarizes conversation history to free context space, it does so with lossy compression. Instructions captured in CLAUDE.md survive compaction intact, re-injected at the front of the refreshed context. Corrections that appeared only as chat messages may survive as a vague summary with reduced specificity. Architectural constraints must be established in instruction files before the session starts, not issued as chat messages afterward.

Some teams have adopted XML-tagged sections in system instructions to give the model named anchors that are less dependent on position:

<constraints>
  <constraint id="no-orm">All queries are raw SQL via pgx. No ORM.</constraint>
  <constraint id="db-boundary">Database access only through /packages/db. Not the pg package directly.</constraint>
</constraints>

The model can reference these by label rather than relying purely on proximity, which partially mitigates positional decay over long sessions.

How the Tool Ecosystem Handles This

The major AI coding tools have converged on three distinct approaches to pre-loading design context, each with different tradeoffs.

Static injection (CLAUDE.md, .cursorrules, .github/copilot-instructions.md) occupies the front of the context. The content is under developer control, the position is privileged, but the token budget is finite and shared with task content. Claude Code’s layered memory system supports an @ import syntax that allows a root CLAUDE.md to compose in additional documents: @docs/architecture.md, @docs/prohibited-patterns.md. Teams can organize design context by domain without packing everything into one file.

Structural indexing is the approach Aider takes with its repo map. Rather than developer-authored prose, the repo map is a compressed structural skeleton generated from tree-sitter and ctags, containing file paths, class names, and function signatures without bodies, included in every prompt automatically. It scales with the codebase without requiring manual curation. The limitation is that it encodes topology, not intent. The repo map shows what the codebase looks like, not why it looks that way or what conventions must be followed throughout.

Cursor’s rules system (v0.43+) adds glob-scoped rules in a .cursor/rules/ directory alongside embedding-based retrieval. The retrieval layer is convenient but opaque; on any given query you cannot always determine precisely what context the model is working from, which matters when your architectural constraints are not visible throughout the codebase but must apply everywhere.

Dynamic retrieval via Model Context Protocol (released by Anthropic in November 2024) standardizes how agents retrieve context at task time. This is valuable for large codebases where static injection cannot cover everything, but it shifts the burden to retrieval quality. RAG on code is harder than RAG on prose: code relevance is structural rather than semantic, and cosine similarity on dense embeddings struggles because identifier names carry the most signal. Hybrid retrieval combining BM25 with semantic embeddings and a reranker pass represents current best practice. SWE-bench analysis shows a 40-60 percentage point difference in task completion rates between agents with strong contextual coverage and those with minimal context.

Machine-Checkable Constraints Close the Feedback Loop

The most reliable form of design encoding is one the agent can verify for itself. Prose constraints in instruction files depend on the model’s attention and fidelity across a long session. Machine-checkable constraints produce observable feedback the agent can use to self-correct.

dependency-cruiser enforces module boundary rules as part of the build pipeline. A rule like “packages in /lib cannot import from /services” becomes a linting violation the agent sees when it runs the check. Strict TypeScript configurations with noImplicitAny, exactOptionalPropertyTypes, and path aliases constrain what code the model can write and have it pass the compiler. These tools transform soft conventions into hard constraints with immediate feedback signals.

This matters particularly for longer agentic sessions where the model has processed many tool calls and the design instructions from session start are far from the front of the context. Typed interfaces, enforced module boundaries, and strict linters act as recoverable anchors, constraints the model can rediscover by running the toolchain regardless of what it still holds from the initial instructions.

The Whiteboard Analogy Has a Limit

Garg’s analogy, treating AI like a new engineer and walking it through goal alignment before approach before interface before implementation, is useful but not fully precise. A new human engineer retains what you told them in the morning through the afternoon. A conversation with an AI has no persistence between sessions, and within a session the reliability of early context decays as the session grows longer.

The more precise framing: design alignment is not just about output quality, it is about managing the finite and positionally sensitive resource that is the context window. Doing design alignment in the first exchanges, before retrieving large files or running extensive tool chains, preserves the attention gradient for the constraints that matter most. Doing it at the end, when you realize the direction is wrong, puts corrections in exactly the part of the context where attention is least reliable.

METR’s analysis of SWE-bench solutions documents the gap between patches that pass automated tests and patches that would survive real code review. Tests do not capture architectural placement, convention adherence, or scope discipline. A model working without design context produces test-passing code that sits in the wrong layer, reaches for a deprecated internal pattern, or modifies adjacent code it had no reason to touch. Both failure modes trace to alignment rather than hallucination: the model made locally plausible decisions without the context to know they were globally wrong.

Front-loading the design conversation addresses this at the source. The mechanics of attention and context decay mean that resolution is substantially cheaper the earlier it happens, which is the case for the pattern stated in terms of the actual mechanism rather than the workflow intuition.

For teams already using Architecture Decision Records, the discipline maps well. ADRs structured to separate current guidance from historical context give the model durable, parseable design intent rather than a chronological log it has to reason backward through. The content is the same kind of content that belongs in a priming file: not what the codebase looks like, but why it looks that way and what that implies for the next decision.