One Million Tokens: What Anthropic's Context Expansion Actually Changes

Anthropic has made 1 million token context generally available for both Claude Opus 4.6 and Sonnet 4.6. That’s a 5x jump from the 200K context that defined the Claude 3.x generation, and it puts Anthropic on level footing with Gemini 1.5, which has had 1M context since early 2024. Getting here required solving some genuinely hard engineering problems, and the practical implications for developers building on top of these models are more nuanced than the headline number suggests.

The Scale of What 1M Tokens Means

One million tokens is roughly 750,000 words, or about 1,500 pages of dense prose. In software terms, that’s comfortably enough to hold a mid-sized monorepo’s worth of source files, a year’s worth of Slack message history, or all the documentation for a complex framework plus the code that implements it, all in a single context.

For practical comparison: GPT-4 Turbo shipped at 128K tokens, which already felt like a lot at the time. Claude 3’s 200K was a meaningful improvement. At 1M, you’re entering territory where the bottleneck genuinely shifts from “how much can I fit” to “how do I structure what I’m passing in.”

The Engineering Lift

Getting to 1M tokens is not a matter of flipping a configuration switch. Standard transformer attention has a memory cost that scales with the square of sequence length. A naive implementation at 1M tokens would require storing an attention matrix of 10^12 entries per layer, which is completely intractable.

The practical solutions involve several overlapping techniques. FlashAttention, developed at Stanford and now standard across frontier model training and inference, computes attention in tiles, keeping the working set in fast SRAM rather than materializing the full attention matrix in HBM. This turns the memory complexity from O(n²) into something manageable without changing the mathematical result.

On the KV cache side, grouped query attention (GQA) and multi-query attention (MQA) reduce the number of key-value heads that need to be stored per token. With GQA, you might have 8 query head groups but only 2 KV heads per group, cutting cache memory significantly without much perceptible quality loss on most tasks.

There’s also the question of positional encoding. Standard RoPE (Rotary Position Embedding) needs to be extended or modified for very long sequences since it was originally designed for shorter ranges. Techniques like YaRN (Yet Another RoPE extensioN) or dynamic NTK-aware scaling allow models to generalize to sequence lengths longer than those seen during training.

The Problem That Doesn’t Go Away

More context is not the same as better use of context. The 2023 “Lost in the Middle” paper by Liu et al. demonstrated a consistent pattern across large language models: performance on retrieval and reasoning tasks peaks for information placed at the start or end of the context window, and degrades noticeably for information buried in the middle. The effect becomes more pronounced as context length grows.

Anthropthropics has invested significantly in evaluating and improving long-context retrieval fidelity, and the Claude 3.x models showed strong results on “needle in a haystack” style evaluations at 200K. Whether those properties hold reliably at 1M, and particularly whether the “lost in the middle” degradation is mitigated, is something that will need to shake out through real-world use rather than internal benchmarks.

The practical takeaway: for tasks where the relevant information could appear anywhere in a long document, retrieval augmented generation (RAG) with a smaller focused context may still outperform naively stuffing everything into one 1M-token prompt. Long context and retrieval-based approaches are complementary, not mutually exclusive.

What This Looks Like at the API

Using 1M context through the Anthropic SDK is straightforward. The same messages.create call works; you just pass more content:

import anthropic

client = anthropic.Anthropic()

with open("full_codebase.txt", "r") as f:
    codebase = f.read()

message = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": (
                f"Here is the full codebase:\n\n{codebase}\n\n"
                "Identify all places where error handling is inconsistent "
                "or missing entirely. Focus on API boundary functions."
            )
        }
    ]
)

print(message.content[0].text)

The catch is latency and cost. Input token pricing scales linearly, and at 1M tokens, even a single request carries a significant input cost. Time-to-first-token (TTFT) also increases with context length, since the model must process the full context before generating output. For latency-sensitive applications, this matters. Streaming helps with perceived responsiveness, but the model still needs to process the full input before the first token of output can begin.

For Discord bot use cases specifically, this changes the calculus on conversation memory management. Instead of truncating history or using summarization chains to fit inside a 200K window, you can now retain much longer raw conversation histories. For a busy server with high message volume, even 1M tokens will eventually fill up, but the frequency of needing to summarize or truncate drops dramatically.

How This Compares to Gemini

Gemini 1.5 Pro has offered 1M context (and 2M in some variants) since February 2024. Gemini 1.5 Flash also supports 1M tokens. Google built this capability into the Gemini 1.5 generation through their Mixture of Experts architecture combined with linear attention approximations that make very long contexts tractable.

The comparison matters for developers choosing between providers. Gemini has had more time to gather production feedback at this scale, and their pricing for long-context inputs is competitive. On the other hand, Claude’s strength in nuanced reasoning and instruction-following on complex prompts is well-established. At 1M tokens, those reasoning qualities become more important, not less, since the tasks you’d bring to this context length tend to be complex by nature.

What Actually Changes for Agents

Agentic workflows stand to benefit most from this expansion. An agent that can maintain a full, uncompressed view of its working environment, tool call history, past reasoning steps, and current state without hitting memory limits will reason more consistently across long-running tasks. Compression and summarization introduce information loss; having the raw history available removes that as a failure mode.

For the kinds of multi-step automation I build, the transition from managing context as a scarce resource to treating it as generally abundant is genuinely useful. It shifts the design question from “what do I cut” to “what do I include and why,” which is a better problem to have. The architectural discipline of thinking carefully about what context actually matters doesn’t disappear, but the hard limits that forced it become softer constraints.

Million-token context being generally available in both Opus 4.6 and Sonnet 4.6 means it’s accessible at multiple price points. Sonnet, which sits below Opus in capability tier but significantly below in cost, now gives developers a way to experiment with long-context workflows without the full cost of Opus on every request. That’s probably the more practically significant part of this announcement for most teams.