Token counts are not a detail you can ignore when you’re paying per million tokens. A tokenizer change between model versions is one of the quieter ways your API bill can shift without any change on your end, and Claude 4.7’s tokenizer is getting attention on Hacker News for exactly this reason.
Let me go beyond the measurements and explain what drives these differences, how to instrument your own workloads, and where the cost pressure actually comes from.
How Tokenizers Work and Why They Change
Every LLM uses a subword tokenizer, almost universally Byte Pair Encoding (BPE) or a close variant. BPE starts with individual bytes or characters and iteratively merges the most frequent adjacent pairs into single tokens, building a vocabulary. The final vocabulary size, and which pairs got merged in which order, determines how any given string gets tokenized.
When Anthropic trains a new model, they typically train a new tokenizer alongside it, or at minimum retune the merge rules against a refreshed corpus. A corpus that contains more code, more structured data, more multilingual text, or more API-style JSON than the previous training run will produce different merge priorities. The result is a vocabulary that carves up common patterns more efficiently in some areas and less efficiently in others.
This is not a flaw. It is the tokenizer learning the distribution of text it will actually see. But from a cost perspective, the changes are asymmetric: if your workload looks like the new training distribution, you pay less; if it diverges from it, you pay more.
Measuring Token Counts with the Anthropic SDK
Anthropic exposes a dedicated count_tokens endpoint that lets you measure token usage without sending a full inference request. This is the right tool for benchmarking tokenizer changes across model versions.
import anthropic
client = anthropic.Anthropic()
def count(model: str, text: str) -> int:
response = client.messages.count_tokens(
model=model,
messages=[{"role": "user", "content": text}]
)
return response.input_tokens
samples = {
"prose": "The deployment pipeline failed at the integration step because...",
"json": '{"event": "user.signup", "timestamp": 1713456000, "userId": "u_8f3k"}',
"code": "def fibonacci(n: int) -> list[int]:\n a, b = 0, 1\n return [a := a + b - (b := a + b - a) for _ in range(n)]",
"numbers": "192.168.1.1 port 8443 pid 23901 latency 142ms status 200",
}
models = ["claude-opus-4-5", "claude-opus-4-7"]
for label, text in samples.items():
counts = {m: count(m, text) for m in models}
delta = counts[models[1]] - counts[models[0]]
print(f"{label}: {counts} | delta {delta:+d}")
Running this against real workloads gives you an honest picture before you migrate. The count_tokens call is cheap and does not consume output tokens, so you can run it at scale against a representative sample of your production traffic.
Where the Cost Differences Concentrate
Tokenizer efficiency is not uniform across content types. Based on how BPE vocabularies develop and what tends to shift between model generations, the variance concentrates in a few predictable areas.
Numbers and identifiers. BPE struggles with arbitrary numeric strings because they appear in so many combinations that few merge rules apply broadly. An IP address like 192.168.1.105 might tokenize to eight or nine tokens on one model and six on another, depending on whether the vocabulary learned common IP-like patterns. Log files, database IDs, and telemetry data all exhibit this sensitivity.
Code with dense symbol usage. Languages like Rust, C++, and regular expressions contain sequences of punctuation that are rare in prose but common in code. If the new tokenizer was trained on a corpus with more code, merge rules for ->, ::, =>, !=, and similar patterns improve. TypeScript generics, CSS selectors, and shell one-liners tend to see the largest deltas.
Non-Latin scripts. CJK characters, Arabic, Hebrew, and Indic scripts are a known pressure point. Each Unicode character may fall back to multiple byte-level tokens if the vocabulary lacks sufficient coverage. A tokenizer trained on a corpus with better multilingual representation will compress these scripts more efficiently. The inverse is also true: a corpus rebalanced toward English-heavy technical writing might regress on previously covered scripts.
Repeated whitespace and indentation. Python, YAML, and Markdown all use indentation semantically. Some tokenizers learn to merge runs of spaces into single tokens; others do not. A four-space indent that cost one token can cost four if a merge rule was dropped.
Cost Arithmetic at Scale
For a rough sense of what tokenizer drift means at production scale, consider a workload that sends 50 million input tokens per day at Claude Opus 4.7 pricing. If a tokenizer change increases average token count by 8% across your content mix, that translates to 4 million additional input tokens per day. At $15 per million input tokens, that is $60/day or roughly $22,000/year from a change you did not make.
The math inverts when the tokenizer becomes more efficient for your content type. Teams running heavily code-focused workloads may find Opus 4.7 cheaper per unit of work than its predecessor, not more expensive, because the new vocabulary compresses source code better.
This is why the original measurement post sparked 357 comments on HN: the outcome is workload-dependent, and the only honest answer is to measure.
Comparing Across Providers
For context, OpenAI’s tiktoken library handles tokenization for GPT-4o and the o-series models. GPT-4o uses the o200k_base vocabulary with 200,000 tokens, up from the 100,277-token cl100k_base used by GPT-4. The larger vocabulary improves compression, particularly for code and multilingual content, and Anthropic’s own vocabulary expansion across Claude generations follows a similar logic.
You can compare tokenization behavior cross-provider with a simple harness:
import tiktoken
import anthropic
enc = tiktoken.get_encoding("o200k_base") # GPT-4o
text = open("sample.py").read()
openai_tokens = len(enc.encode(text))
anthropic_tokens = anthropic.Anthropic().messages.count_tokens(
model="claude-opus-4-7",
messages=[{"role": "user", "content": text}]
).input_tokens
print(f"OpenAI o200k_base: {openai_tokens}")
print(f"Anthropic Opus 4.7: {anthropic_tokens}")
print(f"Ratio: {anthropic_tokens / openai_tokens:.3f}")
For typical Python source code, the ratio tends to sit close to 1.0, but it diverges for dense symbol usage, long numeric strings, and non-ASCII content.
What to Do With This Information
If you are running a cost-sensitive Claude integration, the practical response is straightforward. Pull a week of representative prompts from your logs, run them through count_tokens on both the old and new model, compute the distribution of deltas, and check whether the tail is in your favor or against you.
If your workload skews toward content types that expanded in token count, the options are: compress inputs more aggressively (trim verbose JSON keys, strip redundant whitespace, abbreviate log lines before sending), cache at the application layer to reduce repeat sends of the same content, or evaluate whether a smaller model in the Claude 4 family covers your quality requirements at lower token cost.
If your workload benefits from the new tokenizer, you can bank the savings or scale up context length without proportionally increasing cost, which is the more interesting case for long-document analysis and extended agentic runs.
Tokenizer changes are not going away. Every generation of frontier models recalibrates the vocabulary, and the costs follow. The teams that notice first are the ones who built measurement into their deployment process rather than waiting for a billing surprise.