What Comes After Continuous Batching: The Bottleneck Chain in LLM Serving

Continuous batching is now table stakes for any serious LLM serving system. The HuggingFace article on continuous batching from first principles, originally published in November 2025, walks through why static batching fails and how iteration-level scheduling fixes it. What the article necessarily compresses is the second-order effect: continuous batching, once deployed, immediately exposed a new bottleneck that required a different class of solution entirely, and that solution exposed yet another one. This post traces the full chain.

Why Static Batching Fails

LLM inference has two phases with fundamentally different computational profiles.

The prefill phase processes all input tokens in a single forward pass. Every token in the prompt attends to every other token; the computation scales with the square of sequence length but runs in parallel across the entire sequence. The GPU is doing dense matrix multiplications: compute-bound work.

The decode phase generates one token per forward pass, autoregressively. Each step produces a single new token. The per-step computation is small, but it requires reading the entire model’s weights from GPU memory to produce that one token. This is memory-bandwidth-bound: the ratio of floating-point operations to bytes transferred is very low.

Static batching groups multiple requests into a fixed batch and runs them together until all requests complete. The problem is that output lengths vary enormously. If one request in a batch of 16 generates 2000 tokens and the others finish after 50, those other 15 requests hold their completed KV caches in GPU memory for the entire remaining duration, and their GPU slots remain occupied by finished work that cannot serve anyone new.

Under realistic traffic with variable output lengths, GPU utilization in static batching systems falls to 20-40%. The GPU is not computing; it is waiting for the slowest request to drain the batch.

What Iteration-Level Scheduling Actually Means at the Code Level

The Orca paper from OSDI 2022 named and formalized the fix: treat each forward pass iteration as the scheduling unit rather than the full request lifecycle. After every decode step, the scheduler checks for completed requests, removes them from the batch, and immediately admits new requests from the queue to fill the vacated slots.

This sounds simple, but it requires changing how the batch is represented in memory. In static batching, the batch is a fixed tensor of shape [batch_size, max_seq_len], padded to uniform length. The forward pass operates on this rectangular tensor. Continuous batching cannot use this representation because requests have different current sequence lengths.

Orca’s answer was selective batching. For layers where inter-request interaction does not matter, specifically the feed-forward network layers, layer norms, and projections, you concatenate all token representations along the batch/sequence dimension into a single flat tensor and process them together. For attention layers, where each request must attend only to its own KV cache, you process requests separately.

A simplified picture of what this looks like at the loop level:

while running_requests or waiting_queue:
    batch = scheduler.step(running_requests, waiting_queue)
    
    # Concatenate hidden states: shape [total_tokens, hidden_dim]
    hidden = torch.cat([req.last_hidden for req in batch], dim=0)
    
    for layer in model.layers:
        # Attention: per-request, using each request's own KV cache
        attn_outputs = []
        for req in batch:
            q = layer.q_proj(hidden[req.token_slice])
            k, v = req.kv_cache[layer_idx]
            attn_outputs.append(attention(q, k, v))
        hidden = torch.cat(attn_outputs, dim=0)
        
        # FFN: batched across all tokens from all requests
        hidden = layer.ffn(hidden)
    
    for req in batch:
        token = sample(hidden[req.token_slice[-1]])
        req.generated_tokens.append(token)
        if token == EOS_TOKEN:
            running_requests.remove(req)
            yield req

This structure is what keeps utilization high: the batch is always full because the scheduler fills gaps immediately, iteration by iteration.

The Memory Problem That Continuous Batching Exposed

Once continuous batching raised GPU utilization toward 80-90%, the new bottleneck was KV cache memory. With many more requests in flight simultaneously, the GPU’s HBM filled up faster than before.

Naive KV cache management pre-allocates a contiguous block of memory per request at admission time: max_output_length × num_layers × 2 × hidden_dim × bytes_per_element. For LLaMA-2 70B in FP16, this works out to roughly 1.3 MB per token. A system targeting 100 concurrent requests with a maximum output length of 2048 tokens needs around 266 GB just for KV cache, exceeding any single GPU’s capacity by a large margin.

Systems worked around this by setting conservative maximum output lengths and accepting high memory waste. The vLLM paper (SOSP 2023) measured that 60-80% of allocated KV cache memory went unused in practice, because actual outputs were much shorter than the reserved maximum.

vLLM introduced PagedAttention to solve this. The design is directly analogous to OS virtual memory paging.

KV cache is divided into fixed-size blocks, each holding the keys and values for a fixed number of tokens (the default is 16 tokens per block). A sequence’s logical KV cache is a list of block IDs; the actual physical blocks can be anywhere in GPU memory. A block table maps logical block numbers to physical GPU addresses:

Request A logical blocks:  [block 0] → [block 1] → [block 2]
Physical GPU addresses:     @0x1000      @0x4200      @0x0800   (non-contiguous)

A custom CUDA kernel handles the attention computation by using this block table to gather K and V tensors from non-contiguous addresses. The overhead is small, roughly 5-10% of attention computation time, and well worth the savings.

With PagedAttention, blocks are allocated one at a time as a sequence grows. Internal fragmentation is bounded by the block size: at most block_size - 1 slots wasted in the final block of any sequence. At 16 tokens per block, this is negligible. The vLLM paper reports around 4% KV cache waste, compared to 60-80% in prior systems. On an A100 80GB GPU serving LLaMA-13B, this translated to 24x higher throughput than naive HuggingFace Transformers and 3.5x higher than TGI.

PagedAttention also enables copy-on-write KV sharing. If two requests share an identical prompt prefix, they point to the same physical KV blocks for those tokens. When sequences diverge (beam search forks, or different continuations of the same system prompt), blocks are copied on demand. This is the foundation for prefix caching, which SGLang generalized via RadixAttention into a prefix trie that handles arbitrary shared prefixes across requests, yielding 5-10x throughput improvements on workloads with repeated common contexts.

The Latency Problem That PagedAttention Introduced

With continuous batching and PagedAttention working together, throughput was good. A new latency problem emerged that had been invisible before: prefill-induced head-of-line blocking.

Prefill is compute-intensive. A request with a 32,000-token context can take 200-400 milliseconds just to process the prompt before generating its first token. During that prefill iteration, every other request in the batch is stalled because the entire iteration budget is consumed by the single prefill operation.

For a service with mixed short and long inputs, this means that short requests which should return in tens of milliseconds instead wait hundreds of milliseconds every time a long-context request enters the batch. The per-request tail latency degrades significantly even when average throughput looks fine.

The fix, formalized in Sarathi-Serve from Microsoft Research in 2024, is chunked prefill. Instead of processing a long prompt in a single iteration, the prefill is split into fixed-size chunks (say, 512 tokens each). Each iteration processes one chunk of the prefill, interleaved with decode steps for all running requests:

Iteration 1:  [Req A prefill tokens 0-511]    + [Req B decode #140] + [Req C decode #89]
Iteration 2:  [Req A prefill tokens 512-1023]  + [Req B decode #141] + [Req C decode #90]
...
Iteration 64: [Req A prefill tokens 32256-32767] + [Req B decode #203] + ...
Iteration 65: [Req A decode #0]               + [Req B decode #204] + ...

Decode requests make steady progress regardless of what new long-context requests are being admitted. Sarathi-Serve demonstrated up to 6x reduction in P99 time-to-first-token under mixed workloads, with no throughput regression. The key scheduling insight is that a chunk of prefill tokens and a set of decode tokens can be packed into the same forward pass iteration up to a configurable token budget, and the scheduler should prefer keeping decode requests unblocked over fast-pathing new request prefills.

The Pattern

Static batching wasted compute because finished requests held batch slots. Iteration-level scheduling fixed that. High utilization from continuous batching then exhausted KV cache memory because pre-allocation was wasteful. PagedAttention fixed that. Dense batching of active requests then created prefill stalls because long prompts monopolized iterations. Chunked prefill addressed that.

Each solution was motivated by direct measurement, not speculation. Each required a fairly specific change to the forward pass loop or the memory allocator, rather than a general architectural overhaul. The changes compose: production systems today run continuous batching with paged KV cache, chunked prefill, and optionally prefix caching and speculative decoding layered on top.

The current frontier combines speculative decoding with continuous batching. A small draft model generates multiple candidate tokens per step; the large target model verifies them in a single forward pass. Under continuous batching, the verification step processes both draft tokens and regular decode tokens together, which requires careful scheduling to avoid memory pressure. vLLM and SGLang both added speculative decoding support in 2024-2025, with throughput gains of 2-4x on latency-bound workloads where memory bandwidth is the constraint.

Understanding what the batch is, how the scheduler interacts with the forward pass, and where memory goes during inference gives you the ground truth to reason about each of these problems independently. The HuggingFace first-principles article is a good entry point. The chain of problems and solutions it eventually connects to is where the interesting engineering lives.