Serving LLMs at Scale: How Continuous Batching Rewired the Inference Stack
Source: huggingface
The HuggingFace blog recently published a first-principles walkthrough of continuous batching, originally released in November 2025. It is a good piece of pedagogical writing that covers KV caching, chunked prefill, and ragged batching in sequence. What it leaves underexplored is why the industry converged on this particular set of techniques, how they interact with each other under real load, and where the remaining friction lives. That is what I want to get into here.
The Problem That Static Batching Creates
To understand continuous batching, you have to sit with the static batching problem for a moment. When you batch N requests together in a conventional setup, you need rectangular tensors. Every sequence in the batch must have the same length, so you pad the shorter ones. This is already wasteful, but it becomes actively harmful when sequences finish at different times.
Imagine eight requests, each with up to 100 tokens of output. In practice, some finish at token 12, others at token 87. Under static batching, the GPU keeps running the finished ones as padding until the longest completes. The compute is just gone. The HuggingFace post puts a number on it: inserting a single new prompt into a batch of 8 with sequence length 100 incurs (n-1)(B-1) = 693 wasted padding tokens from alignment alone. Scale that to thousands of concurrent requests and you have a system that is often more than 50% idle.
The deeper issue is that LLM inference has two fundamentally different operational phases: prefill and decode. Prefill processes the entire prompt in parallel, making it memory-bandwidth-bound and compute-dense. Decode generates one token per step for every active sequence, making it memory-bandwidth-bound but compute-sparse. Static batching treats them identically, which suits neither.
Iteration-Level Scheduling: The Core Idea
The conceptual breakthrough came from the Orca paper (OSDI 2022) by Yu et al. at Seoul National University. They called it iteration-level scheduling: instead of holding a batch together for an entire request, release slots at every generation step. When a sequence emits its end-of-sequence token, that slot goes back to the scheduler immediately. A waiting request can fill it on the very next forward pass.
This sounds obvious in retrospect, but it required rethinking the serving architecture significantly. The batch composition changes every iteration. You cannot precompile static shapes for CUDA kernels if the tensor dimensions shift step-to-step. Orca reported up to 36.9x higher throughput than prior work on long-generation workloads, which is the kind of number that gets an idea adopted quickly.
KV Caching: The Memory Trade That Makes It Possible
Continuous batching would be impractical without the KV cache. During autoregressive generation, each new token needs to attend to every previous token. Without caching, computing the attention scores for token N requires reprocessing tokens 0 through N-1 every single step. That is O(n²) compute per request.
The KV cache stores the key and value projections from each attention layer so they can be reused. Computing token N becomes O(n): you load the cached K and V tensors, compute the new Q projection for the current token, and run attention against the full history. The cost is memory rather than compute.
For a concrete sense of scale, the HuggingFace post works through Llama-2-7B:
Cache size per token = 2 × L × H × A
= 2 × 32 × 32 × 128
= 262,144 values
At float16 (2 bytes each), that is 524,288 bytes, or roughly 512 KB per token. Wait, the post says 16 KB per token, so let me re-examine: 2 × 32 layers × 32 heads × 128 head_dim = 262,144 values; at float16 that is 524,288 bytes. The 16 KB figure in the source uses a different accounting, likely per-layer or per-head. Regardless, for a 4096-token context, a single Llama-2-7B request consumes on the order of 2 GB of KV cache. With 80 GB of VRAM on an A100 and model weights consuming roughly 13 GB in float16, you have around 60 GB for KV cache, supporting maybe 30 simultaneous 4096-token requests before memory pressure hits. This ceiling defines how aggressive your scheduler can actually be.
Chunked Prefill: Fitting Variable-Length Prompts
The prefill phase processes an entire prompt in one pass, which is efficient but creates a memory spike. A 32K-token prompt on Llama-2-7B would try to materialize an enormous set of intermediate activations all at once. Chunked prefill solves this by splitting the prompt into memory-budget-sized pieces and processing them sequentially, accumulating the KV cache incrementally.
The algorithm is straightforward:
For a prompt of length n, given memory budget m:
num_chunks = ceil(n / m)
for i in range(num_chunks):
chunk = prompt[i*m : (i+1)*m]
kv_cache = forward(chunk, prepend=kv_cache)
Each chunk prepends the KV states from prior chunks before running attention. The attention mask ensures each new chunk’s tokens can see everything before them but nothing after. This is standard causal masking applied incrementally.
The subtle scheduling implication is that chunked prefill lets you mix prefill and decode sequences in the same batch. You can allocate, say, 1024 token slots: give 128 to a prefill chunk and 896 to decoding sequences. The GPU never goes idle waiting for a long prefill to complete before accepting more decodes. TGI (Text Generation Inference) and vLLM both use variants of this mixed-batch approach.
Ragged Batching: Eliminating the Rectangular Constraint
With continuous batching, sequences in the same batch have different lengths at every step. The classical approach pads them to a uniform length, but this reintroduces the waste we were trying to eliminate. Ragged batching (also called variable-length batching) concatenates sequences along the sequence dimension instead of padding them.
The attention mechanism needs modification to prevent tokens from different sequences interacting with each other. The solution is a block-diagonal attention mask:
Sequence A tokens: [a1, a2, a3]
Sequence B tokens: [b1, b2]
Concatenated: [a1, a2, a3, b1, b2]
Attention mask (causal, no cross-sequence):
a1 sees: [a1]
a2 sees: [a1, a2]
a3 sees: [a1, a2, a3]
b1 sees: [b1]
b2 sees: [b1, b2]
This block-diagonal structure is enforced by setting off-diagonal blocks to -inf (or False in boolean masks) before the softmax. The result is that the GPU processes a single heterogeneous sequence while respecting the isolation between requests. Flash Attention 2 has first-class support for variable-length inputs via its varlen API, which avoids materializing the full attention matrix for sparse masks.
The Scheduler as Bottleneck
Here is where the HuggingFace post ends and the real engineering begins. Once you have iteration-level scheduling, chunked prefill, KV caching, and ragged batching working together, the scheduler becomes the critical path. It runs synchronously between every forward pass and must make decisions quickly enough not to add measurable latency.
The basic scheduling loop looks like this:
def schedule_next_batch(waiting_queue, running_sequences, memory_budget):
batch = []
# Decoding sequences always get a slot (1 token each)
decode_slots = sum(1 for s in running_sequences if s.phase == 'decode')
remaining = memory_budget - decode_slots
batch.extend(running_sequences)
# Fill remaining slots with prefill chunks
for request in waiting_queue:
chunk_size = min(remaining, request.remaining_prefill_tokens)
if chunk_size > 0:
batch.append(request.next_chunk(chunk_size))
remaining -= chunk_size
if remaining == 0:
break
return batch
vLLM’s scheduler is more sophisticated: it tracks memory pages explicitly (via PagedAttention), handles preemption when memory pressure spikes, and makes decisions about evicting KV cache pages to CPU memory. PagedAttention is the companion innovation to continuous batching: it applies the virtual memory and paging concept from operating systems to the KV cache, eliminating fragmentation and enabling fine-grained memory sharing for requests with common prefixes (a significant win for chatbot systems where every conversation starts with the same system prompt).
Where Friction Remains
Continuous batching is not a free lunch. Three problems persist in production deployments.
First, prefill and decode have different hardware requirements. Prefill is compute-bound (high arithmetic intensity), while decode is memory-bandwidth-bound (low arithmetic intensity, just one token per request per step). Running them together in mixed batches is a compromise that suits neither optimally. Some production systems, including certain configurations of TensorRT-LLM, use disaggregated prefill/decode: separate GPU pools handle each phase, and only KV cache tensors are transferred between them. This maximizes hardware utilization but adds latency from the transfer.
Second, the KV cache memory ceiling creates head-of-line blocking. When the cache is full, new requests cannot start prefill until running requests complete and free pages. The scheduler must decide whether to preempt a running request (expensive, requires eviction to CPU) or queue the new arrival. There is no universally correct answer, and the right policy depends on traffic patterns.
Third, continuous batching assumes you can predict memory requirements per request. In practice, requests generate variable numbers of tokens depending on model behavior, and users do not always specify max_tokens. Overcommitting KV cache pages leads to mid-request evictions; undercommitting wastes capacity. Systems like LMDeploy and vLLM have tunable parameters for this, but getting them right for a specific workload requires profiling.
Why This Matters Beyond Throughput
The throughput gains from continuous batching are well-documented: the Orca paper showed ~37x improvements on long-generation workloads, and vLLM demonstrated 2-4x throughput improvements over Hugging Face’s naive serving at the time of its release. But there is a less-discussed consequence for pricing and accessibility.
Before continuous batching was widely deployed, LLM serving was expensive enough that most API providers priced heavily on compute time. With high GPU utilization from continuous batching, the cost per token dropped significantly. This is a structural shift: the per-token pricing model common today would be economically difficult to sustain on static-batching infrastructure.
For anyone building on top of inference APIs or deploying their own models, understanding these mechanics matters for capacity planning, cost modeling, and knowing what levers to pull when latency or throughput degrades. The HuggingFace walkthrough is a solid foundation. The rest is working through the scheduler code in vLLM or TGI, running load tests, and developing intuition for how prefill/decode ratios shift under different traffic patterns.