· 5 min read ·

The Gap Between Benchmark and Production: Claude's 1M Context Goes GA

Source: simonwillison

Simon Willison noted that Anthropic has made one million token context windows generally available for claude-opus-4-6 and claude-sonnet-4-6. No waitlist, no special access request, no beta flag in your API header. It is just available.

The distinction between “available in beta” and “generally available” sounds like paperwork, but it matters in practice. Beta access means no SLA, potentially volatile pricing, and the implicit understanding that behavior may change. GA means Anthropic considers the capability production-ready, is committing to its pricing structure, and is willing to put it in front of every API customer without restriction. For anyone building real systems, that is the meaningful threshold.

Why This Took Longer Than the Headline Suggests

Google made one million tokens generally available in Gemini 1.5 Pro in May 2024, nearly two years ago. That gap reflects the difference between demonstrating long context in a controlled setting and shipping it with the reliability, cost structure, and quality benchmarks that production systems require.

Extending a transformer to handle million-token sequences requires solving several compounding problems simultaneously. Standard scaled dot-product attention runs in O(n²) time and memory relative to sequence length. Doubling the context does not double the compute; it quadruples it. Getting from 200K to 1M tokens means roughly 25x the attention cost of a 200K context, not 5x.

The practical solution stack involves Flash Attention (which avoids materializing the full attention matrix in GPU SRAM, reducing memory from O(n²) to O(n)), Ring Attention (which distributes the sequence across multiple accelerators so no single device needs to hold the full KV cache), and architectural choices around Grouped Query Attention that reduce the size of the KV cache itself. Combining these into a stable serving system that can handle thousands of concurrent 1M-token requests is a different engineering challenge than publishing a proof-of-concept.

Positional encoding is another piece. Standard learned absolute position embeddings do not extrapolate past their training length. Rotary Position Embeddings with extensions like YaRN allow models to handle sequences longer than anything seen during training, but that extrapolation degrades without specific long-context fine-tuning. Reaching 1M almost certainly required a dedicated training stage on sequences of appropriate length, not just a change in serving configuration.

Accepting and Using Are Different Problems

A model accepting 1M tokens and a model faithfully using 1M tokens are distinct capabilities. The Stanford NLP paper Lost in the Middle documented a systematic failure mode in 2023: models show U-shaped recall curves across long contexts, reliably retrieving information at the very beginning and end while missing content buried in the middle. This is not a quirk of one architecture; it follows from how attention patterns form during training when most sequences are short.

The standard benchmark for evaluating this is the Needle in a Haystack test, where a specific fact is inserted at a known position within a large irrelevant document and the model is asked to retrieve it. Passing NIAH at 200K is now table stakes. Demonstrating consistent performance across the full 1M range, at every position, is what separates a real production capability from a headline number.

Anthropic’s history here is one of the stronger arguments in its favor. Claude 3 and 3.5 models were notable for high NIAH scores at 200K, outperforming competitors that technically supported the same window. If that same care for long-context quality carried through the extension to 1M, the GA announcement carries more weight than a comparable announcement from a vendor with weaker quality track records. The evidence that matters, though, is published eval numbers rather than press releases.

How Big Is a Million Tokens, Concretely

A million tokens maps to roughly 750,000 English words, or about eight average-length novels in a single context. A dense technical codebase of 40,000 to 50,000 lines fits comfortably. An hour of meeting transcript runs roughly 8,000 to 10,000 words, meaning 75 to 90 hours of recordings could fit in a single request. The full text of Wikipedia’s English articles on a given topic, including supporting linked articles, becomes tractable to load directly.

That scale is useful for humans in a few narrow scenarios, but it is most significant for automated pipelines. No person reads 750,000 words in a single session and then asks a question; agents do, and for them the context ceiling has been the binding constraint on what they can reason over in a single pass.

The Economics Require Prompt Caching

At standard Opus 4.6 input pricing, a full 1M-token context call is expensive. Without caching, long-context workflows are financially impractical at production scale.

Anthropic’s prompt caching changes the math considerably. Once a context prefix is cached using cache_control, subsequent requests that reuse the same prefix pay roughly 10% of the uncached input cost. For any workflow involving multiple questions against the same large document or codebase, the first call is expensive and every subsequent call becomes tractable.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": large_codebase_content,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "List all database calls that do not use parameterized queries."}
    ]
)

The same cached prefix can be reused across different questions within the cache TTL. For document review, security audits, or any multi-turn analysis workflow, this makes the cost curve look very different from the per-call headline number. Designing around caching is not optional at this scale; it is the architecture.

Where This Actually Matters

The clearest wins are in software engineering agents and document intelligence. For agents doing codebase-wide analysis, the shift from 200K to 1M means a moderately complex monorepo fits in context without chunking. Chunking introduces its own failure modes: cross-file references get severed, global context about naming conventions and data flow disappears, and retrieval-augmented approaches require maintaining a separate embedding index that drifts from the actual code. Loading the full repository is slower on first call but produces qualitatively different results for tasks that require whole-program understanding.

Legal document analysis, technical due diligence, and systematic literature review fall into the same category. These tasks have historically required custom chunking pipelines, retrieval layers, and careful prompt engineering to maintain coherence across document boundaries. A 1M context window does not eliminate the need for thoughtful system design, but it removes a category of problem that was previously unavoidable.

OpenAI’s flagship models remain capped at 128K tokens as of this writing. For anyone choosing between providers for long-context applications, that gap is now hard to ignore when Anthropic and Google both offer 1M at GA.

Google had this headline number first, with Gemini 1.5 Pro shipping 1M context at GA in May 2024. The competitive question has always been whether quality holds at that length. Anthropic’s track record at 200K is a reasonable prior, but the NIAH and LOFT benchmark scores at 1M are what will actually differentiate providers in production decisions. The GA milestone puts Anthropic’s implementation to that test in earnest.

Was this interesting?