· 7 min read ·

Speculative Decoding Benchmarks Have Been Lying to You

Source: huggingface

Speculative decoding has become the dominant story in LLM inference optimization over the past two years. The premise is elegant: use a cheap draft model to speculatively generate several tokens ahead, then verify all of them in a single parallel forward pass through the target model. If enough drafts are accepted, you get multiple tokens per expensive step instead of one. The math works, the outputs are provably identical in distribution to greedy or sampled decoding from the target model, and the speedups reported in research papers look compelling.

The problem is that almost nobody is measuring it correctly.

NVIDIA’s SPEED-Bench is the first unified benchmark designed to address the fragmentation in how speculative decoding (SD) is evaluated. The core critique it levels at existing benchmarks is not subtle: prior work uses small prompt sets with limited semantic diversity, tests at batch size one, relies on non-production inference stacks, and sometimes uses random or synthetic tokens as inputs. Each of these choices individually distorts the results. Together, they paint a picture of speculative decoding performance that bears little resemblance to what you will observe in a real serving deployment.

The Metric That Matters, and Why It Is Not Enough

The primary quality metric for speculative decoding is Acceptance Length (AL): the expected number of draft tokens accepted per verification step. If your draft model proposes four tokens and the target model accepts three of them on average, your AL is roughly three. In ideal conditions, this translates to a proportional throughput increase, because you are generating three tokens per forward pass instead of one.

But AL alone does not predict real-world speedup. Two other factors intervene. First, the draft model has its own compute cost. N-gram speculation (which pattern-matches against the context to predict continuations) is nearly free, but a trained autoregressive drafter like EAGLE3 adds meaningful overhead. Second, the relationship between AL and end-to-end throughput is non-linear under batched serving conditions. At high concurrency, the target model’s forward pass is already compute-bound across many requests, and the per-token overhead of verification changes character entirely.

Existing benchmarks have largely ignored both complications. SpecBench, the most commonly cited prior benchmark, evaluates on six task categories with a relatively small prompt set at batch size one, using reference implementations rather than production inference engines. The acceptance lengths it reports may be accurate for that narrow slice, but they say little about what happens at batch size 32 or 512, across a production-grade system like TensorRT-LLM, vLLM, or SGLang.

How SPEED-Bench Is Structured

SPEED-Bench addresses this with two purpose-built dataset splits.

The qualitative split contains 880 prompts across 11 semantic categories: Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA. These are drawn from 18 publicly available sources and selected using an embedding-based diversity maximization algorithm. The process embeds all candidate prompts using openai/text-embedding-3-small, then selects 80 prompts per category by minimizing average pairwise cosine similarity within each group. This is a meaningful methodological choice; random sampling from existing benchmarks tends to cluster around popular prompt types, whereas diversity maximization forces coverage of the full semantic surface.

The throughput split addresses the serving regime problem. It provides 1,536 prompts per input sequence length (ISL) bucket, with buckets ranging from 1k to 32k tokens. Prompts are stratified into low-entropy, mixed-entropy, and high-entropy difficulty categories, and the framework tests concurrency from batch size 1 to 512. Critically, the framework integrates directly with production inference engines, handling tokenization and prompt formatting externally so that all systems process identical inputs. The reported metrics include output tokens per second, user-level TPS (a proxy for end-user latency), time-to-first-token, and end-to-end request time.

What the Findings Actually Reveal

The most striking result in the SPEED-Bench paper is the variance in acceptance length across domains. For Llama 3.3 70B paired with EAGLE3 at batch size 32, the AL in Coding is 3.0, while in Multilingual it drops to 1.7. Roleplay sits at 2.04. The mean across all categories is 2.45. If you had benchmarked only on coding tasks, you would overestimate real-world acceptance length by roughly 20% compared to a mixed workload. This is not a rounding error; it changes whether speculative decoding is worth deploying at a given serving scale.

The pattern is explained by output entropy. Coding and math tasks have relatively low-entropy outputs: the next token in a Python function or a LaTeX equation is more predictable than the next word in a creative roleplay response. The draft model’s job is easier when the target distribution is more concentrated. Multilingual and open-ended writing tasks push acceptance lengths down because the sample space of plausible continuations is genuinely wide, and the draft model is wrong more often.

This domain dependence has a concrete implication. If your serving workload is primarily coding assistance, speculative decoding will likely deliver substantial throughput gains. If it is a general-purpose chat assistant handling mixed domains, the average acceptance length will be meaningfully lower, and your projections should account for that.

The Vocabulary Pruning Side Effect

Perhaps the most practically important finding concerns EAGLE3’s vocabulary pruning optimization. The idea behind pruning is to reduce the size of the draft model’s output vocabulary, which speeds up draft token generation. On narrow benchmarks, this appears to be a straightforward win: faster drafting with negligible quality impact.

SPEED-Bench’s semantic diversity reveals what narrower evaluations miss. Vocabulary pruning has minimal impact in Coding and Math, where the relevant vocabulary is naturally constrained. But in Multilingual, RAG, and Summarization categories, it causes substantial degradation in acceptance rates. These are domains where the full vocabulary matters, because proper names, domain-specific terms, and multilingual tokens get pruned away. The result is that the draft model proposes tokens the target model would never emit in those contexts, acceptance rates fall, and the throughput gains from pruning are erased or reversed.

This is a clean example of why benchmark diversity matters. A vocabulary pruning optimization that looks universally beneficial on coding-heavy eval sets has significant caveats on realistic mixed workloads. Without the semantic coverage that SPEED-Bench enforces, that side effect would remain invisible until deployment.

Draft Strategy Comparison: N-Gram, Post-Trained, and Co-Trained

SPEED-Bench evaluates three model configurations that span the main architectural choices in speculative decoding today.

N-gram speculation on Llama 3.3 70B is the lowest overhead option. It requires no additional model, makes predictions by matching token sequences against the recent context, and costs almost nothing to run. The tradeoff is acceptance length: it averages 1.41 AL across categories, and delivers 0.88x speedup in the benchmark conditions tested, meaning it is actually slower than baseline in this configuration. N-gram works best when the output is highly repetitive or closely mirrors the input, such as in copy-and-fill RAG scenarios or code generation from templates.

EAGLE3 on GPT-OSS 120B represents the post-hoc trained drafter approach. A small autoregressive model is trained separately using the target model’s hidden states as additional input features, which gives it much better prediction accuracy than a standalone small model. This achieves a mean AL of 2.25 and 1.34x speedup. EAGLE3 is the current state of the art for post-training approaches, but the vocabulary pruning caveats apply.

MTP (Multi-Token Prediction) on Qwen3-Next represents co-training. Rather than training a separate draft model after the fact, MTP integrates additional prediction heads directly into the base model training process. The SPEED-Bench results show this achieves the highest acceptance lengths, averaging 2.81 AL across categories. The tradeoff is architectural: you cannot simply apply MTP to an existing pretrained model. You need to train from scratch with the additional heads, which means committing to this inference strategy during initial model development. DeepSeek V3 and its successors use a similar approach, and it is increasingly standard in frontier model training.

The Random Token Problem

One methodological finding deserves particular attention for anyone building their own evaluation infrastructure. SPEED-Bench documents that using random tokens as benchmark inputs systematically overestimates throughput by approximately 23% when speculative decoding is enabled.

The failure mode is subtle. When a model receives random token sequences as input, it either defaults to predictable acknowledgment responses (which inflates acceptance length because the outputs are low-entropy and easy to draft) or anchors to a specific keyword and generates coherent-looking text in response to noise (which deflates acceptance length but in a non-representative way). Neither behavior resembles what happens with real prompts. Additionally, random tokens do not trigger realistic expert routing in Mixture-of-Experts models, making throughput measurements inaccurate even without speculative decoding involved.

This matters because synthetic benchmarks that use random or templated inputs to control ISL will produce acceptance length and throughput numbers that do not transfer to real workloads. SPEED-Bench handles ISL control through deterministic truncation and padding of real prompts, preserving semantic content while hitting the target sequence length.

What This Means for Infrastructure Engineers

If you are deciding whether to enable speculative decoding in a production deployment, the honest answer from SPEED-Bench is that it depends heavily on your workload and your serving regime. Coding-heavy workloads with low batch sizes benefit substantially. Mixed-domain chat endpoints at high concurrency are more complex, and the gains may be smaller than benchmark papers suggest.

The benchmark and its measurement framework are both open. The qualitative and throughput dataset splits are available at nvidia/SPEED-Bench on HuggingFace. The framework integrates with TensorRT-LLM, vLLM, and SGLang, so you can run it against your actual serving stack with your actual model, rather than trusting that someone else’s benchmark on different infrastructure will generalize.

That last point is probably the most important takeaway. Speculative decoding is not a universal accelerator. It is a workload-dependent optimization with a complex interaction between domain, concurrency, system software, and draft strategy. Any benchmark that does not capture that interaction is not measuring speculative decoding; it is measuring a best case.

Was this interesting?