· 6 min read ·

The 23% Problem: How SPEED-Bench Exposes What Speculative Decoding Benchmarks Get Wrong

Source: huggingface

Speculative decoding has become one of the most consequential techniques in LLM inference engineering over the past two years. The core algorithm, introduced independently by Leviathan et al. and Chen et al. in late 2022 and early 2023, is straightforward: a small draft model proposes several tokens in sequence, and the large target model verifies them in a single parallel forward pass. Modern GPUs are memory-bandwidth-bound rather than compute-bound at moderate batch sizes, so verifying five draft tokens costs nearly as much as generating one token autoregressively, which means accepted drafts translate to essentially free throughput.

The ecosystem has responded accordingly. vLLM, SGLang, and TensorRT-LLM all support multiple speculative decoding variants. EAGLE, EAGLE2, and EAGLE3 have pushed post-training draft head acceptance rates steadily higher. Models trained with multi-token prediction (MTP) heads baked in during pretraining, a technique formalized by Gloeckle et al. at Meta and deployed prominently in DeepSeek-V3, have demonstrated that training-time investment can outperform post-training approaches.

But performance claims in the literature have rested on benchmarks poorly suited to the diversity of real workloads. Most evaluation sets concentrate on tasks where speculative decoding performs well by construction: coding benchmarks, math reasoning, structured output. Evaluation tooling has varied from paper to paper, with different base models, tokenizers, and sampling parameters, and no unified head-to-head comparison between draft methods across a shared framework. NVIDIA’s SPEED-Bench, released on Hugging Face, addresses these gaps directly.

Why Random Tokens Break Everything

The most striking finding in SPEED-Bench is quantitative: benchmarks that generate random token sequences as synthetic prompts overestimate speculative decoding throughput by approximately 23%. The failure comes from two directions.

When presented with incoherent random input, models frequently recognize the input as noise and produce short, generic acknowledgment responses. These responses have highly predictable token sequences, so draft acceptance lengths inflate significantly. A model that responds with a request for clarification to every random prompt shows excellent draft acceptance because the response is repetitive and low-entropy, not because speculative decoding is working well on realistic inputs. SPEED-Bench’s measurements showed this trivial-response mode inflating acceptance length to 3.44 on random inputs.

Models sometimes take the opposite path, latching onto random tokens that resemble real words or topics and generating plausible but hallucinated responses anchored to noise keywords. These responses are higher-entropy and deflate acceptance lengths below realistic levels, dropping to around 1.877 in the observed cases. The net effect across both failure modes is the 23% throughput overestimation when speculative decoding is enabled.

There is a separate problem for mixture-of-experts architectures. In MoE models, the expert activation pattern for each layer depends on the input tokens at that layer. Random token sequences route through different expert subsets than semantically coherent text, which means throughput measurements using random inputs misrepresent baseline MoE performance before speculative decoding enters the picture.

Semantic Diversity as a First-Class Constraint

SPEED-Bench’s qualitative split covers 880 prompts across 11 semantic categories: coding, math, humanities, STEM, writing, summarization, roleplay, RAG, multilingual, reasoning, and QA, with 80 samples per category. Prompts within each category are selected using openai/text-embedding-3-small embeddings to minimize average pairwise cosine similarity; the goal is maximum intra-category diversity, not just broad category coverage.

This design choice matters because acceptance rates are strongly distribution-dependent. The results across methods make this concrete:

DomainN-Gram (Llama 3.3 70B)EAGLE3 (GPT-OSS 120B)MTP (Qwen3-Next)
Coding1.54 AL2.46 AL3.34 AL
Math1.43 AL2.46 AL3.13 AL
Roleplay1.15 AL1.87 AL2.09 AL
Writing1.33 AL1.98 AL2.46 AL
Mean speedup0.88x1.34x1.20x

The N-gram baseline at 0.88x is a net slowdown at batch size 32. N-gram speculation works by copying n-gram matches from the input prompt into the draft sequence, so it performs well on document QA and summarization where output text echoes input, and poorly on open-ended generation where output diverges from any input text. The net slowdown result is not surprising in isolation, but it is the kind of result that disappears when evaluations concentrate on favorable domains.

MTP Heads vs. Post-Training Approaches

The gap between MTP heads and EAGLE3 deserves scrutiny. MTP heads are trained jointly with the base model during pretraining, as in the Gloeckle et al. framework and as deployed in DeepSeek-V3. EAGLE3 trains a separate draft head on the frozen target model’s hidden states using a multi-step consistency objective that reduces error accumulation across draft steps, an improvement over EAGLE2’s single-step training approach that compounded errors badly at longer draft lengths.

On coding tasks, MTP achieves 3.34 average accepted tokens (AL) versus EAGLE3’s 2.46. On roleplay, the gap narrows to 2.09 versus 1.87. Co-training produces a draft head more tightly coupled to the base model’s distribution, which pays off most in high-acceptance, low-entropy domains. In high-entropy domains, both methods struggle, and the coupling advantage from joint training shrinks because neither draft head can reliably predict the target model’s outputs.

The practical implication is durable: models released with MTP heads already trained in have a built-in speculative decoding advantage that cannot be replicated post-hoc by attaching a separate drafter. EAGLE3 closes much of that gap through better training methodology, but on the domains where speculative decoding provides the largest gains, the co-training advantage is consistent.

Vocabulary Pruning and Its Hidden Cost

EAGLE3 includes an optimization that prunes the vocabulary used during draft generation, reducing compute at the draft head. On coding and math domains, the impact is minimal; those domains use a relatively small and predictable token subset. On multilingual, RAG, and summarization domains, the impact is substantial: acceptance lengths drop noticeably when the pruned vocabulary excludes tokens common in those categories.

This finding would not appear under low-diversity benchmarking. An evaluation of EAGLE3 restricted to HumanEval and GSM8K would conclude that vocabulary pruning is essentially free. Teams deploying speculative decoding on multilingual workloads or document-heavy RAG pipelines should treat vocabulary pruning as a tunable parameter rather than a safe default, and benchmark their specific domain distribution before enabling it.

Throughput at Scale

The benchmark’s throughput split extends the qualitative findings to realistic serving conditions. It uses fixed input sequence length (ISL) buckets from 1k to 32k tokens with 1,536 prompts per bucket, split across low, mixed, and high-entropy difficulty levels. Batch sizes run up to 512, enabling Pareto curves that plot total output tokens per second against per-user tokens per second across concurrency levels.

Speculative decoding’s efficiency profile changes substantially with batch size. At batch size 1, the GPU is memory-bandwidth-bound and draft verification costs little extra compute; this is where the technique provides the most benefit. At large batch sizes, the GPU becomes more compute-bound, and the overhead of generating and verifying drafts accumulates. Reporting a single headline throughput number without specifying batch size and ISL conceals this tradeoff almost entirely, which has been standard practice in most published evaluations.

Integration with Production Engines

The benchmark integrates with TensorRT-LLM, vLLM, and SGLang through a pre-tokenization step that handles tokenizer differences externally, ensuring the same tokens reach each engine. This produces consistent measurements across frameworks that would otherwise differ in tokenization handling and timing instrumentation.

A representative run against TensorRT-LLM with EAGLE3:

mpirun -n 1 --oversubscribe python3 run.py \
  --model_dir meta-llama/Llama-3.3-70B-Instruct \
  --tokenizer meta-llama/Llama-3.3-70B-Instruct \
  --draft_model_dir yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
  --dataset speed \
  --dataset_path data/speed/qualitative \
  --tp_size 8 \
  --draft_length 3 \
  --output_length 4096 \
  --engine TRTLLM \
  --concurrency 32 \
  --show_progress

The dataset is available at nvidia/SPEED-Bench on Hugging Face and the measurement framework lives in NVIDIA’s Model Optimizer repository.

What This Changes

The speculative decoding field has had a measurement problem since its inception. Each paper has optimized for its chosen evaluation set, and because those sets concentrated on high-acceptance domains, the performance numbers have not translated cleanly to production systems serving diverse request distributions. SpecBench, an earlier attempt at standardization from Xia et al., covered six tasks but still lacked the semantic diversity mechanisms, ISL-bucketed throughput measurement, and multi-engine consistency that SPEED-Bench provides.

SPEED-Bench surfaces the 23% throughput overestimation from random token inputs, domain-specific degradation from vocabulary pruning, MoE routing artifacts from synthetic data, and the real cost of N-gram speculation at production batch sizes. These findings are not visible under the evaluation methodology that most of the literature has used.

Whether teams adopt SPEED-Bench directly or borrow its methodology, the benchmark makes explicit what has been missing from evaluation practice. The acceptance length measured on coding tasks is not the acceptance length a production assistant will see on a mixed workload of multilingual queries, document summarization, and open-ended generation. Measuring the gap between those two numbers is what SPEED-Bench was built to do, and the results suggest the gap has been larger than the field has acknowledged.

Was this interesting?