Chunked Prefill and the Latency-Throughput Trade-off in LLM Serving
Source: huggingface
The prefill phase and decode phase of LLM inference are different computational animals. Prefill processes an entire input prompt in one forward pass; because all tokens are present simultaneously, attention is fully parallelizable and the GPU runs compute-bound. Decode generates one token at a time, loading the accumulated KV cache for every prior token on each step. This is memory-bandwidth-bound. The arithmetic intensity, measured in FLOPs per byte moved, drops by an order of magnitude between the two phases.
This asymmetry shapes what happens when you mix prefill and decode jobs in the same batch.
What Continuous Batching Fixes
Static batching, the naive approach, collects a group of requests, pads them all to the same length, runs a single forward pass, and repeats until every request in the group has finished. The problem is that shorter sequences finish early but hold their batch slots until the longest sequence completes. GPU utilization degrades as a function of length variance in the batch.
HuggingFace’s write-up on continuous batching from first principles, published in late November 2025, covers this mechanism well. Iteration-level scheduling, the formal name for what most people call continuous batching, solves the slot-holding problem by scheduling at the granularity of individual token generation steps rather than complete requests. When a sequence finishes, its slot is freed immediately and a new request can enter the batch at the next step.
The concept comes from the ORCA paper, published at OSDI 2022 by researchers at Seoul National University. ORCA demonstrated that treating the transformer iteration as the scheduling unit, rather than the full request, substantially improves GPU utilization under realistic workloads with variable-length outputs.
The throughput gains are real. vLLM, which paired iteration-level scheduling with PagedAttention for KV cache memory management, reported 2-24x throughput improvements over naively batched HuggingFace Transformers in its SOSP 2023 paper. The range reflects workload variance: short, uniform requests benefit less from continuous batching than mixed-length traffic does.
KV Cache Memory and Its Coupling With Scheduling
Each concurrent request in a continuous batch needs to maintain its KV cache: the key and value tensors for every attention head at every layer, for every token processed so far. As batch size grows, total KV cache memory grows proportionally. Traditional implementations pre-allocated a contiguous block per request, sized to the maximum expected output length. This caused substantial fragmentation when requests ended before that maximum.
vLLM’s PagedAttention addressed this by applying the virtual memory paging concept to KV cache storage. Rather than requiring contiguous allocation, PagedAttention stores KV cache in fixed-size non-contiguous blocks with a logical-to-physical mapping. The practical result is higher memory utilization and larger sustainable concurrent batch sizes.
Memory efficiency and scheduling efficiency are coupled: you cannot run large continuous batches without solving KV cache fragmentation, and solving fragmentation without good scheduling does not fully realize the available gains.
Prefill-Decode Interference
First-principles explanations of continuous batching typically stop at the scheduling mechanism, but the practical deployment concerns go further.
When a new request arrives in a continuous batching system, it must run its prefill phase before joining the decode pipeline. A long prompt, say 4,000 tokens, requires non-trivial GPU compute. During that prefill step, the GPU resources that in-flight decode requests would otherwise use are partially consumed by the prefill job.
In production, prefill-decode interference shows up as spikes in per-token decode latency that correlate with the arrival of long-context requests. The naive continuous batching formulation treats all iterations as equal-priority, which works when all active requests are in the decode phase but degrades when a large prefill joins an active batch.
This is measurable in deployments that serve mixed short-query and long-context traffic, such as a system handling both a short-message chat interface and a document summarization endpoint on the same serving infrastructure. The prefill cost of a long document lands on every other user’s decode latency for that step.
Chunked Prefill
Chunked prefill addresses the interference problem by splitting long prefill computations across multiple iterations. Instead of processing an entire prompt in one step, the scheduler divides it into segments and interleaves those segments with ongoing decode steps.
This increases time-to-first-token for any individual long-context request, since its prefill now spans multiple iterations. But it bounds the compute any single request can monopolize per step, and existing decode jobs see more predictable per-step latency.
vLLM added chunked prefill support in version 0.4. The relevant parameters are --enable-chunked-prefill and max_num_batched_tokens, which sets an upper bound on total tokens processed per iteration:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_chunked_prefill=True,
max_num_batched_tokens=512, # upper bound on tokens per iteration
)
With a chunk size of 512 tokens, an 8,192-token prefill runs across 16 iterations rather than one. Each iteration takes roughly the same wall-clock time, but no single iteration starves the decode requests in the batch.
SGLang implements similar chunked context scheduling. TensorRT-LLM has had chunked context as a first-class feature since early versions, reflecting NVIDIA’s emphasis on production latency predictability.
The chunk size is a tunable parameter that reflects workload latency priorities. Smaller chunks reduce per-step latency variance for decode requests but increase time-to-first-token for long prompts. Larger chunks favor prompt throughput at the cost of decode stability. Choosing the right value requires knowing your traffic distribution: high-volume short-query workloads tolerate larger chunks; latency-sensitive mixed workloads benefit from smaller ones.
Speculative Decoding and Prefix Caching as Further Refinements
Chunked prefill is not the only layer built on top of continuous batching. Speculative decoding addresses a different inefficiency: the decode phase, even with a well-managed batch, is memory-bandwidth-bound because each token generation step loads the full model weight once. Speculative decoding uses a small draft model to generate multiple candidate tokens per step, then verifies them in parallel with the main model. When candidates are accepted, you get multiple tokens for roughly one forward pass of the main model.
Prefix caching addresses another cost: requests that share a common prefix, such as a fixed system prompt prepended to every user message, re-run the same prefill computation on every request. Systems like vLLM and SGLang can cache the KV activations for a known prefix and reuse them across requests, converting repeated prefill work into a cache hit.
These are all responses to specific costs that emerge from running a continuous batch at scale. Each one trades something: chunked prefill trades individual prompt latency for decode stability; speculative decoding trades extra compute on the draft model for better memory bandwidth utilization; prefix caching trades memory for repeated prefill avoidance. Understanding why each exists makes it easier to decide which ones a given deployment needs.
The Broader Pattern
Continuous batching restructures the latency-throughput trade-off rather than eliminating it. Higher throughput comes from overlapping more concurrent requests, at the cost of higher and less predictable per-request latency under load. Each subsequent refinement narrows a specific failure mode of the basic scheduler: chunked prefill handles prefill interference, speculative decoding addresses memory bandwidth in the decode phase, and prefix caching eliminates redundant prefill work.
The HuggingFace article is a solid grounding in the fundamentals, worth reading if you want to understand what iteration-level scheduling is doing mechanically. The engineering built on top of it is what determines whether those fundamentals translate into a deployment that behaves well under real traffic.