The Two-Phase Problem That Continuous Batching Had to Solve Twice

The HuggingFace post on continuous batching, originally published in November 2025, builds the concept carefully from the attention mechanism up through KV caching, ragged batching, and dynamic scheduling. It is a good foundation. What it stops short of explaining is why naive continuous batching, implemented exactly as described, still produces latency problems in production, and what serving systems actually do about it.

The issue is that LLM inference has two phases with fundamentally different computational profiles, and batching them together without care causes one to sabotage the other.

Prefill and Decode Are Not the Same Computation

When a request arrives at an LLM server, the system first processes the entire prompt in a single forward pass. This is the prefill phase. All prompt tokens are processed in parallel; the transformer sees a matrix of shape [batch_size, prompt_length, hidden_dim] and the GPU’s matrix multiplication units run at high utilization. Prefill is compute-bound. A 2048-token prompt runs in roughly the same wall-clock time as a 1024-token prompt on modern hardware because the parallelism scales with sequence length.

Decoding is different. Each step generates exactly one new token per sequence, so the batch shape for decode is [batch_size, 1, hidden_dim]. The GPU loads the entire model’s weight matrices to multiply against a tiny input. Most of the bandwidth is spent moving weights from HBM to compute units, not doing the multiplication itself. Decode is memory-bandwidth-bound. Doubling the batch size in decode roughly doubles throughput with no latency penalty, because the bottleneck is weight loading, not compute.

This distinction matters because the KV cache sits at the boundary. During prefill, you fill the KV cache from scratch for every prompt token. During decode, you append one entry per step and read back the entire cache for attention computation. The access patterns, memory pressure, and compute intensity are different enough that the two phases behave like different workloads sharing the same hardware.

The Interference Problem

Continuous batching, at its core, says: when a sequence finishes decoding, immediately admit a new request rather than waiting for the entire batch to drain. This solves the straggler problem from static batching. What it does not solve is what happens when a new request enters the batch and triggers prefill while other sequences are actively decoding.

A prefill for a 2048-token prompt takes the GPU for multiple milliseconds. During that time, every decoding sequence in the batch is blocked from generating its next token. From those sequences’ perspective, their inter-token latency just spiked by however long the prefill took. At scale, when new requests arrive continuously, you end up with decoding sequences experiencing irregular per-token latency that correlates with the lengths of new requests being prefilled alongside them.

The effect shows up as tail latency. A P99 time-to-first-token metric looks reasonable because most requests complete quickly, but P99 inter-token latency for streaming responses jumps whenever a long prefill lands in the same batch. Users see this as jitter: tokens arrive in bursts with gaps between them rather than at a steady rate.

Chunked Prefill as a Scheduling Tool

The solution is to never let a single prefill monopolize a forward pass. Instead of processing an entire prompt in one step, chunked prefill breaks it into chunks of at most m tokens, interleaving those chunks with decode steps from active sequences.

The HuggingFace article describes chunked prefill as a memory management technique: if the prompt exceeds the available token budget, split it. That framing is correct but undersells what chunked prefill actually enables. With a small enough chunk size, you can bound the worst-case prefill contribution to any single forward pass. If decode sequences need 100 tokens of compute budget per step and the chunk size is also 100, a new request contributes at most 100 tokens of prefill work before the scheduler returns control to the decoding sequences.

This turns prefill scheduling into a knob. Setting a small chunk size minimizes decode latency jitter at the cost of longer time-to-first-token for the incoming request, since its prompt takes more forward passes to fully process. Setting a large chunk size reduces time-to-first-token for new requests at the cost of adding latency variance to existing decode streams. There is no universally correct setting; the right value depends on whether your workload is more sensitive to streaming latency or first-token latency.

vLLM exposes this via --max-num-batched-tokens, which bounds the total tokens processed per forward pass and indirectly controls how much prefill can land in a single step. TGI (Text Generation Inference) handles it through its max_batch_prefill_tokens parameter. Both systems implement the same underlying mechanism: a scheduler that tracks a token budget per step, fills it with decode sequences first, and uses remaining capacity for chunked prefill work.

The Memory Accounting That Makes This Work

For chunked prefill to be safe, the server needs precise tracking of how much KV cache memory each sequence is using at any moment. A prefill chunk allocates cache entries for the tokens it processes; a decode step allocates one more entry. Admitting a new request requires knowing whether enough cache space exists for at least one chunk of its prompt.

For a model with L layers, H key-value heads, and head dimension d, storing the KV cache for a single token requires 2 × L × H × d × sizeof(dtype) bytes. For Llama-2-7B in float16, that works out to 2 × 32 × 32 × 128 × 2 = 524,288 bytes, or 512 KB per token. A 2048-token context therefore needs roughly 1 GB of KV cache. On an 80 GB A100 after loading 14 GB of model weights in float16, you have around 66 GB available for cache, enough for about 130,000 tokens worth of active contexts across all sequences in the batch.

PagedAttention, described in the vLLM paper by Kwon et al., handles this allocation non-contiguously. Rather than reserving a contiguous buffer for the maximum possible output length of each sequence, it allocates fixed-size pages (typically 16 tokens each) on demand. A sequence that generates 50 tokens gets 4 pages; one that generates 500 tokens gets 32 pages. Pages from different sequences can be scattered across GPU memory with no fragmentation penalty because the attention kernel uses a page table to locate them. This is directly analogous to how OS virtual memory maps virtual pages to physical frames, and for the same reason: contiguous allocation wastes space when allocation sizes are unpredictable.

The combination of PagedAttention and chunked prefill is what makes the scheduler’s token budget accounting practical. Without PagedAttention, you need to know in advance how many tokens a sequence will generate, which you do not. Without chunked prefill, you cannot safely admit long prompts without stalling the decode pipeline.

Scheduling Policy Choices

Once the mechanism exists, the interesting decisions are in scheduling policy. Which requests get admitted first? When memory pressure is high, which sequences get preempted?

Simple first-come-first-served admission works but can cause head-of-line blocking: one long, slow request holds cache space that could serve several short requests. Priority scheduling helps but requires the client to declare priority, which most do not. Some systems implement shortest-job-first heuristics, estimating output length from prompt characteristics to prefer requests likely to free their cache slots quickly.

Preemption is the escape valve. If the scheduler admits too many sequences and runs out of cache space before they finish, it must either swap sequences to CPU memory or recompute their KV cache from scratch (recompute is often faster than swap on modern hardware). vLLM implements both strategies. The scheduler’s beam search logic tracks which sequences to preempt based on priority and current cache usage, then re-admits them when space opens up.

These scheduling decisions collectively determine the serving system’s behavior under load. Throughput, latency, and fairness are in tension, and the right configuration depends on your application. A code completion tool where users wait actively tolerates low time-to-first-token but hates jitter. A batch summarization pipeline cares only about throughput. The parameters that control chunked prefill chunk size, admission policy, and preemption strategy expose this tradeoff space.

What This Means in Practice

Continuous batching as a concept is simple: retire finished sequences and admit new ones at every iteration. The implementation that actually delivers predictable latency at high throughput requires chunked prefill for phase isolation, PagedAttention for non-contiguous cache management, and a scheduler sophisticated enough to balance the token budget across competing requests.

The HuggingFace article covers the first-principles version of continuous batching well. The production version is built on top of those principles but adds layers that exist specifically because real workloads mix requests of wildly varying lengths, some of which are actively streaming to users who notice irregular pacing. The gap between the conceptual explanation and the operational system is where most of the engineering work lives.