· 6 min read ·

From Padding to PagedAttention: The Research Arc Behind Continuous Batching

Source: huggingface

The Hugging Face blog post on continuous batching, originally published in November 2025, builds up the concept carefully from attention fundamentals through KV caching, chunked prefill, and ragged batching. It is a good reference for understanding the mechanics. What it does not cover is how the research community arrived at these techniques, and what problems remained after each solution was adopted. That progression matters because the pattern repeats: remove one bottleneck, observe the next, repeat.

Static Batching and What It Wasted

Early LLM serving borrowed the batching model from classical ML inference. Collect N requests, process them together as a fixed batch, and return results when the entire batch finishes. The problem is fundamental to autoregressive generation: output length is variable. Request A finishes after 10 tokens; request B needs 200. In static batching, the GPU idles for the slot occupied by A, blocked until B and every other sequence in the batch complete.

The obvious fix is to admit new sequences as old ones finish. The implementation challenge is that doing this correctly requires rethinking how the transformer sees its inputs.

Orca: Iteration-Level Scheduling

The 2022 OSDI paper “Orca: A Distributed Serving System for Transformer-Based Generative Models” by Yu et al. formalized this approach. Rather than scheduling at the request level, Orca schedules at the iteration level: after every single forward pass, one new token generated per sequence, the scheduler can retire finished sequences and immediately admit new ones from the queue.

This required a new execution model. Standard batched transformer inference assumes all sequences in a batch have uniform shape. Iteration-level scheduling breaks that assumption, since sequences are at different positions with different KV cache states. Orca’s solution was selective batching: operations that are independent across sequences, specifically feed-forward layers and projection matrices, batch normally across all active tokens concatenated into a single tensor; attention, which must reference each sequence’s own cached keys and values, is computed per-sequence.

This concatenated-token execution model, where the batch dimension is a flat list of tokens rather than a uniform grid of sequences, is what every modern serving system uses today. The throughput results were 36.9x over prior static-batching baselines on OPT-30B class models, with near-linear scaling across GPUs. Solving the scheduling problem well revealed the next constraint.

vLLM and the Memory Fragmentation Problem

Even with iteration-level scheduling, the number of sequences you can run concurrently is bounded by GPU memory. The KV cache is expensive. For a transformer with L layers, H attention heads, and head dimension d, storing the key and value projections for every token requires 2 * L * H * d * sizeof(dtype) bytes per token. For Llama-2-7B in float16, that works out to 16 KB per token. A single 4096-token sequence consumes 64 MB of KV cache, on a GPU already holding 14 GB of model weights.

Traditional systems pre-allocated contiguous memory for the maximum possible output length per sequence. A request expected to produce up to 2048 tokens gets 32 MB reserved immediately, regardless of how many tokens it actually generates. Actual memory utilization ran around 20-40%, with the remainder wasted on over-allocation and fragmentation between sequences.

vLLM, presented at SOSP 2023, addressed this with PagedAttention. The KV cache is divided into fixed-size blocks, 16 tokens by default, and sequences are assigned blocks on demand as generation proceeds. A block table per sequence maps logical block indices to physical GPU memory locations, analogous to a page table in an OS virtual memory system. Sequences never need contiguous allocation; they accumulate blocks as needed and release them when finished.

Memory utilization rose to approximately 96%. Copy-on-write block sharing enables efficient beam search and parallel sampling, where multiple candidate sequences share a common prompt prefix. Sequences can be preempted by evicting KV cache blocks to CPU RAM and resuming later, or by recomputing the prompt from scratch for short sequences where recomputation is faster than the bandwidth cost of swapping. Compared to HuggingFace Transformers with static batching, vLLM achieved 2-4x higher throughput and up to 24x higher request throughput at equivalent latency targets.

Hugging Face’s Text Generation Inference takes a similar paged approach, built with a Rust scheduler communicating over gRPC with a Python/CUDA model backend. It uses Flash Attention’s variable-length mode to handle ragged batches without padding, and is competitive with vLLM on most workloads while integrating more tightly with the HuggingFace model hub. NVIDIA’s TensorRT-LLM implements the same ideas under the name “in-flight batching,” with custom CUDA kernels that typically achieve 10-20% higher raw throughput on NVIDIA hardware, at the cost of reduced flexibility for new model architectures.

With memory fragmentation addressed, a more subtle problem became visible.

Prefill-Decode Interference

Continuous batching mixes two fundamentally different workloads in every batch. Prefill processes a prompt: N tokens in parallel, highly compute-intensive, with arithmetic intensity matching a matrix-matrix multiply. Decode generates tokens: one token per sequence per step, memory-bandwidth-bound, essentially a matrix-vector multiply per sequence.

When a long prefill request joins a batch of decoding sequences, it dominates the step’s compute time. A 4096-token prefill might extend a forward pass from 20ms to 400ms, stalling every decoding sequence in the batch for that entire duration. Time-per-output-token for ongoing requests spikes unpredictably whenever a large prefill is admitted.

The standard fix, now built into vLLM (v0.3+) and TGI, is chunked prefill: break long prompts into chunks of K tokens, processing one chunk per step interleaved with decode steps. Decode latency impact becomes bounded rather than unbounded. The Sarathi-Serve paper from Microsoft Research (2024) showed that choosing the optimal chunk size improves the throughput-latency Pareto frontier by 1.33-2.6x over standard continuous batching depending on model and hardware configuration.

Microsoft’s DeepSpeed-FastGen generalizes this with Dynamic SplitFuse: rather than chunking prefill only, it fuses prefill chunks and decode tokens into batches targeting a fixed total token count. Every batch targets approximately the same compute cost, smoothing GPU utilization and tail latency rather than just bounding the worst case.

The Frontier: Disaggregation

Chunked prefill reduces interference but does not eliminate it; prefill and decode still compete for the same hardware. The architectural response, now at the frontier of LLM serving research, is running prefill and decode on entirely separate GPU pools.

Prefill instances handle prompt processing in large compute-intensive batches optimized for high parallelism. Decode instances maintain running sequences in small batches optimized for memory-bandwidth efficiency. KV cache blocks transfer between pools via NVLink or RDMA after prefill completes.

Splitwise (Microsoft Research, ISCA 2024) reported 1.4x higher throughput with this approach and up to 4x reduction in P99 time-to-first-token. DistServe (Peking University, 2024) achieved 2x higher goodput, measured as the fraction of requests meeting their latency SLOs, compared to mixed serving with vLLM. Both systems also enable independent scaling of prefill and decode capacity, which matters when workload shape shifts, for instance when average prompt length grows without a corresponding increase in output length.

SGLang’s RadixAttention addresses a complementary problem: a radix tree structure for KV cache blocks enables sharing across requests with common prefixes at arbitrary granularity, reducing cache memory requirements by 30-70% for workloads with shared system prompts or few-shot examples. This is more general than vLLM’s prefix caching, which operates at request boundaries rather than at the token level.

The Pattern

Each solution in this sequence exposed the next problem. Orca’s iteration-level scheduling proved that eliminating straggler waits yielded massive throughput gains, but was constrained by memory fragmentation. PagedAttention freed memory by eliminating over-allocation, but revealed that mixing prefill and decode in the same GPU created internal contention. Chunked prefill bounded that contention, but optimal performance requires complete separation of the two workloads.

This is how systems research tends to move: each optimization shifts the critical path. The Hugging Face first-principles article is a solid starting point for understanding what continuous batching is and how the attention mechanics support it. The research history behind it shows why the field converged on these designs, what tradeoffs each one accepted, and where the open problems are in 2026.

Was this interesting?