Diffusion Language Models Come for the Token-by-Token Bottleneck

Autoregressive generation has been the default mode for large language models since GPT-2, and the cost of that default is structural. Every token waits for the one before it, the KV-cache grows linearly, and the GPU spends most of its time moving weights around instead of doing math. There have been a lot of attempts to chip away at this: speculative decoding, Medusa heads, lookahead decoding, Mamba-style state-space models. NVIDIA’s Nemotron-Labs Diffusion release is the latest swing, and it’s interesting because it doesn’t ask you to throw away the autoregressive checkpoint you already trained.

What’s actually being shipped

NVIDIA released three text models at 3B, 8B, and 14B parameters, plus an 8B vision-language model, under the Nemotron Open Model License (the VLM ships under the more restrictive Source Code License). The training recipe is two-phase: 1.3T tokens of continued pretraining and 45B tokens of supervised fine-tuning, all on top of existing AR-pretrained Nemotron bases. Code lives in Megatron Bridge, and inference is wired into SGLang.

The interesting bit is that one checkpoint serves three different inference modes:

Plain autoregressive, for compatibility.
Diffusion (FastDiffuser), which refines 32-token blocks in parallel.
Self-speculation (LinearSpec), which uses the diffusion path to draft and the AR path to verify.

The headline numbers from the report: 6.4x speedup over AR in the quadratic self-speculation configuration, around 865 tokens/sec on a single B200, and a 1.2% average accuracy bump over Qwen3 8B at the 8B scale.

Diffusion for text, briefly

Diffusion in the image-generation sense (DDPM, Stable Diffusion) operates on continuous pixel-space tensors with Gaussian noise. Text is discrete, which is why the first attempts at diffusion language models (Diffusion-LM in 2022, then SUNDAE, SEDD, and Mercury Coder more recently) had to invent their own noise processes. The two main families are continuous embeddings with rounding, and absorbing-state discrete diffusion, where tokens get masked and progressively unmasked. Inception Labs’ Mercury Coder put the absorbing-state approach into a production-ish coding model last year and showed 1000+ tok/s on H100s.

What Nemotron-Labs Diffusion does differently is the joint training objective. Instead of training a diffusion model from scratch, they take a pretrained AR Nemotron, then continue training with a loss that combines next-token prediction and a diffusion-style parallel denoising objective. The block-wise attention mask is the trick that makes this KV-cache compatible: within a 32-token block tokens can attend bidirectionally (so they can be refined together), but across blocks the attention is still causal. That preserves the prefix cache that every production inference stack depends on. The full mechanics are in the Efficient-DLM paper that the release is based on.

Tokens per forward pass is the right metric

The paper reports 2.6 tokens per forward pass in pure diffusion mode and 6+ in self-speculation. This is the number to watch, because wall-clock throughput on a modern accelerator is dominated by how many useful tokens you can extract per memory-bound transformer pass.

For comparison, vanilla AR is 1.0 by definition. Standard speculative decoding with a small draft model typically lands at 2-3 accepted tokens per verification step (the original Leviathan et al. paper reported ~2.5x for T5-XXL). Medusa heads (Cai et al.) get into the 2.3-3.6x range. EAGLE-2 pushes that to around 4x. So 6x from a single checkpoint, with no separate draft model to maintain, is on the high end of what’s been published.

The self-speculation framing is what makes the number plausible. Instead of running a smaller model alongside the big one, the same network produces a parallel draft (via the diffusion path) and then verifies it (via the AR path). That avoids the memory overhead of a second model and the alignment problems that come with draft/target divergence after fine-tuning.

Why the AR-to-diffusion conversion matters

The choice to convert pretrained AR models rather than train from scratch is the most pragmatic part of this release. Diffusion LMs trained from zero have historically lagged AR models in raw quality at matched compute. Inception’s Mercury is impressive but it’s also a closed model and the public benchmarks are limited. By starting from an AR base, NVIDIA inherits the existing quality floor and only has to recover what’s lost during the diffusion conversion.

The 1.2% accuracy improvement over Qwen3 8B is small but it’s the right sign: the model isn’t worse for becoming a diffusion model. That’s a meaningful claim because most parallel-decoding techniques have a quality tax. Speculative decoding is exact (the verification step guarantees the output distribution matches), but Medusa, EAGLE, and lookahead decoding all trade some quality for speed depending on configuration.

What this changes for inference stacks

If you’re running vLLM, TensorRT-LLM, or SGLang, the operational model for an AR transformer is well understood: paged KV cache, continuous batching, chunked prefill, prefix caching. Diffusion models break some of these assumptions. The block-wise attention is the concession that keeps prefix caching workable, but continuous batching gets more complicated when different requests are in different denoising stages of different blocks. SGLang’s main branch has the initial support; getting feature parity with the AR inference stack will take time.

The other practical question is latency variance. Speculative decoding has well-known throughput wins but its per-token latency is bursty: you either accept several tokens at once or you reject and waste the draft. Diffusion decoding has the same flavor, since each refinement step produces a variable number of finalized tokens. For chat workloads where time-to-first-token and steady streaming matter, this can be a regression even when total throughput goes up.

Where it fits

The interesting comparison isn’t Nemotron-Labs Diffusion against GPT-4 or Llama 3. It’s against the other parallel-decoding approaches. Speculative decoding wins on simplicity and exactness. Medusa and EAGLE win on integration with existing AR stacks. Mamba and other SSMs win on long-context memory profile. Diffusion LMs win, if the numbers hold, on raw tokens-per-second under heavy throughput pressure.

For someone building a Discord bot or a coding assistant, the 6x figure is more relevant than the accuracy bump. Latency at the long tail is what users notice. If a single B200 can sustain 865 tok/s on an 8B-class model without quality loss, the economics of running a chat bot shift meaningfully, especially for batched async workloads where the diffusion mode’s parallelism gets fully exploited.

The release also makes diffusion LMs less of a research curiosity. Up to now the publicly downloadable diffusion text models were small and mostly academic. Having open-weight 8B and 14B checkpoints from a major lab, with a real inference path in SGLang and a permissive license on the text models, is what moves the technique from “interesting paper” to “someone might actually deploy this.” Whether the inference tooling catches up fast enough to make it the default is the next thing worth watching.