One Million Tokens: What Changes When the Context Window Reaches This Scale

Anthropic has made 1 million token context generally available for both claude-opus-4-6 and claude-sonnet-4-6. As Simon Willison noted when the release landed, the change requires no special access tier: it is available through the standard API to all users.

The raw number is easy to wave past. This post is about why 1M tokens is a different category from 200k, where the technical complexity sits, and what it opens up for developers building production applications on top of these models.

The Trajectory That Got Us Here

GPT-3 launched in 2020 with a 4,096-token context. Claude 1 arrived in early 2023 with 100,000 tokens, which at the time felt like a genuine step change for document processing use cases. The Claude 2 and Claude 3 families consolidated around 200,000 tokens as their standard ceiling. Google moved first to 1M with Gemini 1.5 Pro in early 2024, initially behind a waitlist before broader availability. OpenAI’s GPT-4 family has remained at 128,000 tokens.

Anthropic now brings claude-opus-4-6 and claude-sonnet-4-6 to parity with the highest production context lengths available from any major provider.

Why Long Context Is Architecturally Hard

The standard transformer self-attention mechanism has O(n²) time and memory complexity relative to sequence length. Squaring the sequence length squares the compute cost. Going from 200k to 1M tokens, a 5x increase in sequence length, means naive attention would cost roughly 25 times more compute. The inference economics simply do not work at that scale without substantial engineering.

Several techniques reduce this in practice. Flash Attention, developed by Tri Dao and colleagues at Stanford, restructures the attention computation to minimize memory bandwidth pressure by tiling matrix multiplication so it fits in fast on-chip SRAM rather than repeatedly reading and writing to slower HBM. Flash Attention 2 and 3 have pushed this further, and the technique has become standard across most large model inference stacks.

For distributing attention across multiple accelerators, Ring Attention, from researchers at UC Berkeley, partitions key-value pairs across devices and passes them in a ring pattern. This enables near-linear scaling of the effective context window with hardware, rather than hitting a wall at the memory capacity of a single device.

Position embeddings are the other piece. Early models used fixed sinusoidal or learned absolute positions that generalized poorly beyond their training length. Rotary Position Embeddings (RoPE) encode relative position information in a way that extrapolates more gracefully to longer sequences, and variants like YaRN and LongRoPE extend this by adjusting the rotation frequencies for sequences much longer than training distribution. Anthropic has not published specifics of their implementation, but serving 1M tokens commercially implies production-grade versions of techniques in this family.

What 1M Tokens Actually Unlocks

200,000 tokens is approximately 150,000 words, or around 600 pages. That fits most individual documents, most research papers, a moderate-sized codebase with selective file inclusion.

1,000,000 tokens is approximately 750,000 words, or around 3,000 pages. Some thresholds cross here that do not exist at 200k.

A typical medium-sized software project at 50,000 lines of code, assuming an average line length of around 40 characters and four characters per token, occupies roughly 500,000 tokens. The entire project fits in a single context, with room left over for instructions and output. Not a curated selection of relevant files, not a summary of the rest. You can ask about architectural patterns, dependency relationships, or code paths without deciding up front which files matter for the question. Before this, that decision was unavoidable, and the cost of getting it wrong was a model with an incomplete picture.

For legal and compliance work, multiple substantial contracts or regulatory filings that previously required sequential processing can be analyzed together. The model can surface inconsistencies between documents or patterns that only emerge from reading them in relation to each other, rather than treating each document as an isolated unit.

For archival and research applications, a full corpus of papers from a narrow field published over several years can fit in a single context. Synthesis tasks become feasible without a retrieval layer: the model holds the full body of literature and reasons across it directly.

For extended conversation logging and personal knowledge management, several years of messages or notes fit within a single query. The question of what was decided about a given topic six months ago stops requiring search infrastructure and becomes something you can ask directly.

The Constraints That Remain

Cost is the first consideration. A 1M token prompt costs meaningfully more than a 200k prompt, and the gap scales linearly with input length at current pricing. For high-volume inference applications, the economics of long context versus retrieval-augmented generation still matter. RAG remains the right architecture for lookup-heavy tasks where relevant chunks can be identified cheaply. Long context is worth the cost when the task requires reasoning over distributed or interconnected content, where the relationships between sections carry meaning that cannot be pre-identified and extracted.

Latency is the second consideration. Time-to-first-token scales with context length. A million-token request will take longer to begin responding than a shorter one, even with efficient attention implementations. Applications with strict response-time budgets need to benchmark their specific input sizes rather than assuming production latency from shorter-context usage.

Retrieval quality is the third. Research documented in “Lost in the Middle” (Liu et al., 2023) showed that language models systematically retrieve information from the beginning and end of long contexts more reliably than from the middle of a document. Claude’s needle-in-a-haystack evaluations have shown strong performance, and Anthropic has invested substantially in improving long-context retrieval. The underlying pattern has not fully disappeared from the field. For applications where a specific piece of information must be recalled accurately from an arbitrary position in a 1M token input, building explicit retrieval tests into your evaluation suite before assuming production reliability is time well spent.

Using It From the API

The context increase requires no changes to how you call the API. Pass content through the standard messages array to claude-opus-4-6 or claude-sonnet-4-6, and 1M tokens is the default ceiling. The Anthropic API documentation covers the full message format.

import anthropic

client = anthropic.Anthropic()

with open("full_codebase_dump.txt") as f:
    content = f.read()

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[
        {
            "role": "user",
            "content": f"{content}\n\nDescribe the overall architecture and flag any potential security issues."
        }
    ]
)

print(message.content[0].text)

A few patterns improve output quality at this scale. Providing explicit document structure with clear section headers and document boundaries helps the model navigate large inputs. Generating a hierarchical summary as a first pass before asking detailed questions about specific sections often outperforms sending the raw input and asking complex questions directly. And planting known facts at various positions in your context and checking whether the model retrieves them correctly is a useful validation step before deploying any application that depends on long-context recall.

Where This Fits Competitively

The competitive dynamic between Anthropic and Google on context length has pushed real progress. Gemini 1.5 Pro reached 1M tokens in early 2024, and Google has experimented with 2M context in subsequent releases. OpenAI has not moved the GPT-4 family beyond 128k. For application categories where context length is the binding constraint, the choice of model provider has had real consequences, and developers building on the full Claude family now have comparable options to what Google has offered for the past year.

claude-opus-4-6 is the stronger choice for complex reasoning tasks over long inputs where depth of analysis matters. claude-sonnet-4-6 offers better cost and latency for applications where full reasoning depth is not required on every request. For most new projects, starting with Sonnet and escalating to Opus for specific tasks that benefit from deeper synthesis is a reasonable default.

The 1M GA release removes a real architectural constraint that has shaped how certain applications could be built. For use cases that have been scoped around the 200k limit, the limit is now gone.