How SPEED-Bench Rethinks Speculative Decoding Evaluation for Production Workloads

Speculative decoding is one of the more useful inference optimization techniques in current production LLM serving, and the algorithm behind it is genuinely clean. A cheap draft model generates several candidate tokens; the target model verifies all of them in a single parallel forward pass using a modified rejection sampling procedure that preserves the target distribution exactly. When the draft model is accurate, you get multiple tokens per target model invocation at approximately the cost of one. Methods like EAGLE-2 report mean accepted lengths of 4 to 5 tokens on code generation tasks with LLaMA-class models, translating to 3 to 4x end-to-end speedup at batch size 1 on A100 hardware.

The algorithm is clean; the evaluation landscape is not. Before SPEED-Bench, nearly every published result on speculative decoding methods came with hardware, task, and batch size conditions that differed from every other published result. EAGLE reported on Vicuna-13B using HumanEval and MT-Bench at batch size 1 on A100s. Medusa evaluated on LLaMA-2 with Alpaca-style instruction following. REST tested on code repositories. Each paper chose its benchmark conditions, and those conditions typically favored that paper’s method or were simply the most convenient available datasets. No shared evaluation surface existed, so comparing methods was effectively impossible.

NVIDIA’s SPEED-Bench addresses this with a two-split benchmark covering 11 semantic categories, realistic serving conditions up to batch size 512, and a unified measurement framework across TensorRT-LLM, vLLM, and SGLang. The findings surface systematic gaps between how speculative decoding has been evaluated and how it performs in deployment.

Two Splits, Two Different Questions

The qualitative split measures draft model accuracy across diverse semantic domains. It covers 880 prompts across 11 categories: coding, math, STEM, humanities, writing, summarization, roleplay, retrieval-augmented generation, multilingual, reasoning, and question answering. Each category contributes 80 prompts drawn from 24 source datasets, including HumanEvalPack and LiveCodeBench for coding, Humanity’s Last Exam and MMLU-Pro for math, RoleBench and CoSER for roleplay, and OPUS-100 and MCIF across 23 languages for multilingual coverage. Multiturn conversations go up to 5 turns, compared to SpecBench’s maximum of 2.

Prompt selection uses a diversity algorithm based on OpenAI’s text-embedding-3-small embeddings, iteratively selecting examples that minimize average pairwise cosine similarity within each category and running swap-based improvement passes afterward. The results are measurable: SPEED-Bench achieves an average pairwise similarity of 0.14 versus SpecBench’s 0.22. For individual categories the improvement is larger: multilingual drops from 0.36 to 0.06, coding from 0.33 to 0.16, writing from 0.35 to 0.18. Lower cosine similarity means more semantically varied inputs, which is what makes a benchmark representative of a real workload rather than a narrow distributional slice of one.

The primary metrics in this split are conditional acceptance rates and acceptance lengths per category, which measure draft model quality independently of hardware, separating the algorithmic question from the system performance question.

The throughput split answers the system question: what does speculative decoding do for serving throughput and per-user latency under production conditions? It uses input sequence length buckets of 1k, 2k, 8k, 16k, and 32k tokens, three entropy tiers (low, mixed, high) with 512 prompts each, yielding 7,680 total samples across all ISL buckets. Key metrics are output tokens per second across the full batch and per-user tokens per second as a latency proxy. All sequences are truncated or padded using real content, not random tokens. That constraint has measurable consequences.

The Random Token Problem

SPEED-Bench finds that random token inputs overestimate speculative decoding throughput by approximately 23% when a draft model is enabled. The mechanism follows from how the acceptance step works.

The acceptance procedure computes, for each draft token x_i, the acceptance probability min(1, p(x_i | context) / q(x_i | context)), where p is the target model’s distribution and q is the draft model’s distribution. When inputs are random, the context has no structure the models can exploit, and both p and q produce atypical distributions over the vocabulary. The ratio between them does not behave the way it does for real text, producing inflated acceptance rates that do not reflect realistic inference conditions.

For mixture-of-experts models like Qwen3-Next, random inputs also fail to trigger realistic expert routing patterns. This causes both artificially fast target model forward passes and artificially inaccurate acceptance rate estimates, compounding the overestimation. SPEED-Bench documents two named failure modes specific to random inputs: “Trivial Response,” where the model produces degenerate output, and “Topic Latching,” where the model locks onto a repetitive pattern that happens to produce high acceptance rates without being realistic output.

Prior benchmarks that used random or synthetic token sequences for padding or input generation were measuring something that does not correspond to production behavior, and the 23% figure suggests the magnitude is large enough to affect method selection decisions.

Domain Coverage Changes Method Rankings

The qualitative split reveals how strongly speculative decoding performance varies across content types, in ways that single-task benchmarks structurally cannot capture.

The speedup results at batch size 32 with draft length 3 illustrate the pattern across three method families. For Qwen3-Next with natively co-trained MTP heads: coding yields an acceptance length of 3.34, math 3.13, writing 2.46, roleplay 2.09. For GPT-OSS 120B with EAGLE3: coding and math both achieve 2.46, writing 1.98, roleplay 1.87. For Llama 3.3 70B with n-gram speculation: coding 1.54, math 1.43, writing 1.33, roleplay 1.15.

The pattern is consistent across methods: low-entropy structured domains like coding and math yield acceptance lengths 40-60% higher than high-entropy open-ended tasks like roleplay and writing. The difference is large enough that a benchmark using only code generation tasks overstates average acceptance length by the same margin.

What changes across methods is the absolute level. N-gram speculation achieves a mean speedup of 0.88x at batch size 32, meaning it provides no net benefit at that concurrency level. EAGLE3 on GPT-OSS 120B achieves 1.34x. MTP heads on Qwen3-Next reach 1.20x. These relative rankings, and the fact that one method produces a net slowdown at batch size 32, would not be visible from batch size 1 evaluations. Most speculative decoding papers evaluate exclusively at batch size 1, because that is where the technique’s advantage is most pronounced and easiest to measure.

At high batch sizes, arithmetic intensity of the target model forward pass increases, the GPU moves closer to compute-bound operation, and the memory-bandwidth advantage that speculative decoding exploits diminishes. The break-even batch size differs across hardware depending on the memory bandwidth to compute ratio, which varies significantly between A100 (2 TB/s HBM2e, 312 TFLOPS BF16), H100 (3.35 TB/s HBM3, 989 TFLOPS BF16), and lower-end inference GPUs. A benchmark that only reports A100 results at batch size 1 cannot predict H100 results at batch size 16.

Vocabulary Pruning as an Untracked Variable

Production inference systems frequently apply aggressive vocabulary pruning as a system optimization, restricting the output vocabulary at each step. For high-frequency, low-entropy domains like coding and math, this optimization has minimal impact because the output distribution already concentrates on a small vocabulary subset. SPEED-Bench finds substantial degradation for multilingual inputs, RAG responses, and summarization outputs, where long-tail vocabulary items appear regularly.

This is a system-algorithm interaction that only surfaces when the evaluation covers diverse domains. A benchmark built primarily from coding and math tasks would not detect vocabulary pruning degradation at all, producing an optimistic picture of a system configuration that degrades in production on mixed content workloads.

Standardizing the Measurement Infrastructure

SPEED-Bench evaluates three speculative decoding architectures: EAGLE3 (a post-trained small transformer that consumes the target model’s hidden states as input), n-gram speculation (no training required), and MTP heads co-trained with the target model itself. These represent fundamentally different engineering constraints. EAGLE3 requires retraining per target model. MTP heads require access to the training pipeline. N-gram speculation is universal but lower-quality. A benchmark that only evaluates one of these approaches is measuring a specific point in a much larger design space.

The framework runs across TensorRT-LLM, vLLM, and SGLang using pre-tokenized sequences to remove tokenizer behavior as a confounding variable. Captured metrics include step latency, user-level tokens per second, output tokens per second, and fine-grained timing from streaming responses. This separates latency effects from throughput effects rather than collapsing both into a single wall-clock speedup ratio that conflates hardware and algorithm contributions.

The dataset and evaluation framework are publicly available. Reproducibility is the precondition for the rest of this to matter: a benchmark where the conditions cannot be reproduced is not meaningfully different from another paper reporting favorable conditions for its method.

What Changes in Practice

If you are evaluating speculative decoding for a production serving system, single-task speedup numbers from published papers are not predictive of what you will see in deployment. The inputs users send, the batch sizes the system runs at, and the hardware in use all affect results enough to change method rankings. The 23% random token overestimation and the per-domain acceptance length variation are not edge cases; they are the normal variance of a mixed production workload.

SPEED-Bench does not resolve everything. Hardware coverage beyond NVIDIA’s own lineup, quantization interactions when target or draft models are INT8 or FP8, and the continuous batching complexity of mixed-length requests under real serving conditions all remain partially open evaluation problems. The benchmark also requires access to the specific models tested, which limits reproducibility to teams with similar infrastructure.

The core design choices, particularly the semantic diversity algorithm, the real-content-only throughput inputs, and the multi-engine standardized measurement framework, represent a meaningful step up in evaluation rigor from the prior state of per-paper ad-hoc benchmarking. Speculative decoding’s theoretical guarantees are well understood. Getting an accurate picture of when it helps in practice, by how much, and under what conditions, requires benchmarking infrastructure that takes the workload as seriously as the algorithm.