· 5 min read ·

When Million-Token Context Is Table Stakes, Attention Quality Is the Differentiator

Source: simonwillison

As of March 13, 2026, 1M token context is generally available for both Claude Opus 4.6 and Claude Sonnet 4.6. Simon Willison’s coverage of the announcement marks the milestone cleanly: Anthropic’s flagship models now sit alongside the 1M-token context that Google introduced with Gemini 1.5 Pro in February 2024 and that OpenAI shipped with GPT-5.4 earlier this month. A million tokens is now the baseline among frontier models, not a differentiating feature.

That framing matters because it shifts the interesting question. When context ceiling is no longer the constraint, what is? For developers who have been waiting for a Claude model that can swallow a medium-sized codebase whole, the ceiling has finally moved. But the ceiling was never the most important constraint for most real-world use cases.

What 1M Tokens Actually Fits

The number is worth concretizing. One million tokens is roughly 750,000 words. A medium-sized codebase plus its documentation and tests occupies that range comfortably. For legal and compliance work, entire contract histories across hundreds of documents become ingestible in a single prompt. For research synthesis, substantial portions of a paper corpus can sit in context simultaneously. The use case most developers reach for first, debugging a bug that spans a large call chain without switching between files, becomes genuinely feasible without selective retrieval.

Claude at 200K was already more context than most developers needed for most tasks. What 1M changes is the tail of cases: projects where 200K was a genuine constraint, where retrieval-augmented generation felt like a workaround rather than a choice, where the cognitive overhead of deciding what to include in context was the actual bottleneck. Removing that overhead is a real simplification for a specific class of workloads.

The Technical Wall That Had to Move

Getting a transformer to 1M tokens is not a straightforward extension of getting it to 200K. Attention computation is O(n²) in sequence length. At 128K tokens, naive attention requires roughly 1TB of memory for attention scores on a single device. At 1M tokens, that number becomes unmanagebly larger, and no hardware configuration makes it feasible without architectural changes.

The engineering solutions have matured over the past two years. FlashAttention 2 and 3 avoid materializing the full attention matrix through tiling, reducing memory to O(n). Sequence parallelism techniques like DeepSpeed Ulysses distribute attention across GPUs, cutting per-device compute cost proportionally, with benchmarks showing roughly 12x longer sequences in the same hardware envelope compared to naive approaches. For KV cache management at inference, PagedAttention introduced virtual-memory-style block allocation that reduces KV cache waste from 20-40% to under 4%. Chunked prefill, which splits long prompt ingestion across many iterations rather than blocking all compute on a single request, is essential for serving 1M-token prompts without starving concurrent decode requests.

Anthropics specific implementation details for Opus 4.6 and Sonnet 4.6 are not publicly documented, but the infrastructure requirements for this scale are well-understood across the field. Each of these components had to be in place before GA was viable.

The Constraint the Number Does Not Address

Here is what the context ceiling number does not tell you: position within the context window affects model attention in measurable, reproducible ways.

In 2023, Liu et al. published “Lost in the Middle: How Language Models Use Long Contexts”, an experiment tracking retrieval accuracy as the position of relevant information varied across multi-document contexts. The result was a U-shaped performance curve: information at the beginning or end of a long context is retrieved more reliably than information in the middle. The effect holds across model families, across context lengths, and as models scale in capability. Longer context windows stretch the curve rather than flatten it.

This is not a limitation of any specific model. It is a structural property of how transformer attention distributes across long sequences. What it means practically is that a model given 1M tokens of context does not treat all 1M tokens uniformly. The first and last segments of the window are privileged. Everything in the middle competes for attention in a way that degrades predictably with distance from the boundaries.

For developers building on Claude at 1M context, this has direct design implications. Putting everything into one giant prompt and assuming the model will locate relevant information anywhere in the window is the failure mode. Critical constraints, key instructions, the most important facts belong early in the prompt. Less critical background can follow. Anthropic’s own guidance for complex system prompts recommends explicit XML-tagged sections:

<constraints>
  <constraint id="no-orm">All queries are raw SQL. No ORM.</constraint>
  <constraint id="test-first">New functions require unit tests before merge.</constraint>
</constraints>
<relevant_files>
  <file path="src/db/queries.go">...</file>
</relevant_files>
<task>Refactor the user lookup to use the new index.</task>

The labeling gives the model named anchors it can reference semantically rather than by position alone. This does not eliminate the position gradient, but it reduces dependence on positional memory when the context has grown large.

The reason Anthropic’s models have historically benchmarked well on long-context tasks is not just window size. It is context quality within the window. The needle-in-a-haystack evaluations that Claude 3 led at 200K were testing whether the model could actually find information at arbitrary positions in a long context. That quality advantage is the metric that matters at 1M, and whether it holds across the full length is the evaluation worth watching as benchmarks catch up to the new ceiling.

Latency and Cost Change the Architecture Decision

1M context at Claude’s pricing is not cheap. At typical API rates for frontier models, a fully-loaded 1M-token prompt runs to several dollars per call. For agentic workflows that invoke Claude repeatedly across a long task, the per-call cost multiplies fast.

There is also a latency dimension that rarely appears in context window announcements. Prefill, the process of ingesting and processing an input prompt, is not instant at 1M tokens. Even with chunked prefill handling the compute distribution across iterations, time-to-first-token for a fully-loaded 1M context prompt is qualitatively different from time-to-first-token at 10K tokens. For interactive use cases, a 1M context prompt is not a sub-second operation. The specific latency profile depends on infrastructure load and configuration, but it is a real factor in how you design around the capability.

This is why the right architecture question is not “should I use 1M context,” but “when should I use 1M context versus retrieval.” For a coding agent running many short-context subtasks, retrieval-augmented generation with targeted file reads is faster, cheaper, and often more accurate because it avoids the middle-of-context attention trough entirely. For a one-shot analysis where the value is seeing everything simultaneously, 1M context is the right tool.

What Changes for Developers Today

The most immediate practical change is for projects that were genuinely constrained at 200K. Medium-sized codebases, extended research tasks, compliance document review across large corpora: these can now be approached without retrieval scaffolding. That scaffolding had real overhead, including indexing pipelines, retrieval tuning, and the risk that relevant context was not included in what got retrieved. Removing it for the cases where full context is genuinely needed is a meaningful simplification.

The second change is competitive. With 1M context now available across Gemini, GPT-5.4, and Claude Opus 4.6 / Sonnet 4.6, the window size comparison between frontier models has converged. Developers evaluating which model to build on have one fewer dimension to weigh on context ceiling. The remaining question, and the one that has historically most differentiated Claude, is what happens to output quality when those million tokens are actually in the window. That is what the next round of benchmarks will tell us.

Was this interesting?