· 6 min read ·

Paying 45% More for Words You Didn't Ask For: Token Inflation in Claude's Model Upgrades

Source: hackernews

There’s a quiet tax built into every LLM model upgrade, and it rarely appears in the release notes. When Anthropic ships a new version of Opus, the benchmark numbers go up, the capability announcements land, and the blog posts roll in. What gets less attention is that the new model often just… talks more. A lot more.

Bill Chambers’ token leaderboard tracks exactly this: how many output tokens different models produce for a standardized set of prompts. The finding that surfaced on Hacker News to considerable discussion is that Claude Opus 4.7 produces roughly 45% more output tokens than Claude Opus 4.6 when given identical inputs. That number generated 393 comments and 376 upvotes, which tells you this landed somewhere tender for a lot of API users.

What Token Inflation Actually Means

When people say a model has “inflated” token output, they mean that for a semantically equivalent response, the newer model uses more tokens. Not longer because it’s more thorough. Not longer because the task genuinely required it. Just longer, structurally, as a baseline behavior.

Anthropics pricing for Opus 4.7 (as of April 2026) sits at $15 per million output tokens. If your average task previously produced 1,000 output tokens per call on Opus 4.6, and the same task now produces 1,450 tokens on Opus 4.7, your output token costs have increased by 45% without your prompts changing by a single character. For teams running high volumes of agentic workflows, code generation pipelines, or document processing, that isn’t a rounding error.

Latency compounds this problem. Output tokens are generated sequentially; time-to-completion scales roughly linearly with token count for non-streaming consumers. A 45% token increase translates into a 45% slower wall-clock completion time for the same task, assuming similar tokens-per-second throughput. In agentic loops where a model might be called dozens of times to complete a multi-step task, that overhead accumulates fast.

The Verbosity Pattern Across the Industry

This isn’t specific to Anthropic or to the 4.6-to-4.7 transition. Token inflation has been a consistent pattern across the frontier model landscape as models scale.

OpenAI’s GPT-4o was noticeably more verbose than GPT-3.5-turbo on comparable tasks. Google’s Gemini 1.5 Pro produced significantly longer outputs than Gemini 1.0 for structured tasks. The phenomenon appears to be partly an artifact of RLHF and instruction tuning at scale: human raters in feedback pipelines tend to rate longer, more elaborated answers as higher quality, so models get trained toward verbosity as a proxy for quality. The model learns that hedging, summarizing, re-stating the question, and providing caveats correlates with positive reward signal.

This creates a misalignment between what makes a response feel thorough to a human rater in a feedback session and what makes a response economically useful to a developer who is paying per token and building something that has to work reliably at scale.

Are the Extra Tokens Doing Anything?

The more charitable interpretation of token inflation is that it reflects genuinely better reasoning. Chain-of-thought research has demonstrated that models perform better on complex tasks when they produce more intermediate reasoning steps. If Opus 4.7 is producing more tokens because it is working through problems more carefully, then the inflation might be justified.

The problem is that this is difficult to verify on a per-token basis, and the 45% figure appears to hold across a broad distribution of tasks, not just the complex reasoning cases where extended output would be expected. Simple extraction tasks, short-form generation, classification with explanations, all show elevated counts. That pattern is harder to defend as quality-driven.

Some developers have started adding explicit token budget instructions to system prompts. Something like Be concise. Limit your response to what is strictly necessary. or Do not re-state the question, do not summarize at the end, do not add caveats unless directly relevant. These instructions have real effect, but they also require active prompt engineering work on the developer’s side to counteract a default behavior the model shipped with.

Anthropics own extended thinking feature adds a separate thinking token budget that is billed and visible separately from the main response. That separation is useful because it lets you see and control the reasoning overhead explicitly. The problem with standard token inflation is that there is no analogous separation; the extra tokens are woven into the response body and hard to distinguish from substantive content without manual inspection.

The Cost Model Problem

Consider a concrete scenario. You have a pipeline that calls Claude Opus to review and annotate code changes across a large repository. At Opus 4.6’s token volumes, you budgeted $2,000 a month for this workload. You upgrade to Opus 4.7 because the code understanding benchmarks are better and you want the quality gains. Your bill climbs to $2,900 without the workload changing. The benchmark improvements might be real, but whether they justify a 45% cost increase is a question that requires careful measurement, not assumption.

The responsible path here is to benchmark output quality at the task level before committing to a model upgrade, not just read the release notes. Token leaderboards like Bill Chambers’ are useful precisely because they surface this kind of empirical information in a way that isn’t buried in marketing. The community response on Hacker News reflects that a lot of developers discovered this the hard way rather than ahead of time.

Practical Mitigations

If you are running meaningful workloads on Claude and are evaluating the 4.7 upgrade, a few approaches are worth considering before you flip the switch.

First, run your actual production prompt distribution through both models and measure output token counts directly. Do not rely on the general inflation number from a leaderboard; your specific prompts may be above or below the average.

Second, audit whether your tasks actually benefit from the additional verbosity. For agentic tool use where the model needs to reason through multi-step plans, more tokens may genuinely improve reliability. For classification or extraction tasks, the extra output is probably noise.

Third, invest in token budget instructions in your system prompt if you are token-sensitive. Instructions like respond in the minimum tokens necessary to fully address the task have measurable effect on modern instruction-tuned models. Combine these with output format constraints where the task allows it: requesting JSON or structured output rather than prose reduces the room for filler.

Fourth, consider whether you actually need Opus for every task in your pipeline. Sonnet 4.6 sits at a considerably lower price point and lower baseline verbosity. Routing simpler subtasks to Sonnet while reserving Opus for the genuinely complex steps can bring overall cost and latency back toward what you were used to.

What This Should Prompt

The broader takeaway from the leaderboard finding is that model version numbers are not a reliable proxy for cost behavior. A new model being better does not mean it will be cheaper to operate, and the token inflation problem suggests it may frequently be more expensive even at the same nominal per-token price. Evaluation frameworks that only test quality and ignore output length miss a real component of production cost.

For developers building on top of these APIs, output token counts belong in your performance monitoring the same way latency and error rates do. If you only alert on cost as a lagging indicator, a token inflation event on model upgrade will show up as a budget overrun rather than a detectable behavioral change. Tracking tokens-per-task as a metric in your observability stack gives you early warning when model behavior shifts under your workloads, whether from a model upgrade you chose or from a provider-side update you didn’t.

The 45% figure is a benchmark average across a standardized prompt distribution. It is a useful signal, not a definitive number for your specific case. Measure first, then decide whether the capability improvement justifies the cost, because the two are genuinely separable questions.

Was this interesting?