The OS Ideas That Unlocked LLM Inference Throughput

The scheduling insight that makes modern LLM serving fast was formalized in a 2022 USENIX paper, but its implications continued to be unpacked for years afterward. HuggingFace’s retrospective on continuous batching, originally published in November 2025, walks through the mechanics from scratch. It is a useful prompt to trace how this idea evolved and where the current systems-level work is heading.

What Static Batching Gets Wrong

Traditional inference servers borrowed a batching model from image classification: collect a fixed set of requests, pad them to the same length, run a single forward pass, return results. This works well for workloads where outputs are a fixed, predictable size. Language model generation breaks that assumption immediately.

Autoregressive generation is fundamentally sequential. Each output token depends on all prior tokens. A batch of four sequences, where three finish after 50 tokens and one runs to 500, will keep the GPU busy on that single surviving sequence while the other three slots sit idle. The GPU is nominally processing a full batch but doing a quarter of the compute it could be doing. On realistic request distributions, where output lengths vary by an order of magnitude, GPU utilization under static batching can fall to 10-30% of theoretical throughput.

Padding compounds this. Sequences in a batch must all be the same length, so shorter sequences get padded to match the longest. This wastes both compute on padding tokens and KV cache memory pre-allocated for positions that will never hold real data.

Iteration-Level Scheduling

The fix, formalized in the Orca paper (Yu et al., USENIX OSDI 2022), is to reschedule after every single token generation step rather than after every complete request. Orca calls this “iteration-level scheduling.” After generating one token, any sequences that produced an end-of-sequence token are removed from the batch, and new requests from the waiting queue fill the freed slots.

The batch composition changes continuously at every forward pass, rather than being fixed at submission time. Orca demonstrated up to 36.9x higher throughput over static batching systems on GPT-scale models running on A100 hardware.

One subtlety this immediately surfaces is the distinction between the prefill and decode phases. Prefill processes the entire input prompt in a single forward pass; it is compute-bound and handles all prompt tokens in parallel. Decode generates one token at a time and is memory-bandwidth-bound, with the bottleneck being the cost of reading the KV cache from GPU HBM on each step. These two phases have fundamentally different computational shapes, which makes mixing prefill and decode requests in the same batch non-trivial. Early implementations often ran them sequentially: prefill a new request, then insert it into the decode batch.

The Memory Problem That Came Next

Continuous batching at the scheduling level does not solve the memory problem. You can swap requests in and out dynamically, but if you pre-allocate contiguous KV cache memory per sequence, fragmentation remains severe.

The KV cache holds the attention keys and values for every past token in a sequence, and it is the dominant consumer of GPU HBM during inference. For a 13B-parameter model in float16, that works out to roughly 0.8 MB per token. A sequence running to 4096 tokens needs over 3 GB of KV cache on its own. With static allocation, you must reserve the maximum possible sequence length up front, because you do not know how long a completion will be. This causes both internal fragmentation, reserved memory that goes unused when sequences end early, and external fragmentation, gaps between allocations that prevent large contiguous reservations even when total free memory is sufficient.

The vLLM paper (Kwon et al., ACM SOSP 2023) solved this by applying a concept directly from operating systems: virtual memory paging. They call it PagedAttention.

PagedAttention: OS Paging Applied to KV Caches

In OS virtual memory, physical RAM is divided into fixed-size pages. A process’s virtual address space maps to physical pages through a page table, and those pages need not be contiguous in physical memory. This eliminates external fragmentation entirely and bounds internal fragmentation to at most one page per allocation.

PagedAttention applies the same structure to KV cache management. GPU HBM is divided into fixed-size blocks, typically 16 or 32 tokens each. Each sequence’s KV cache is stored across non-contiguous blocks, with a per-sequence block table mapping logical block indices to physical locations in memory. The attention kernel is rewritten to follow this indirection when reading and writing KV entries.

Blocks are allocated on demand as sequences grow, and freed immediately when sequences complete. Any free block serves any sequence. Memory waste is bounded to the last partially-filled block per sequence, which vLLM measured at around 4% total waste, compared to 20-40% waste in systems without paging.

Fitting more concurrent sequences into available HBM translates directly to higher throughput. vLLM reported up to 24x throughput improvement over HuggingFace Transformers (naive batching) and up to 3.5x improvement over Orca alone, since PagedAttention allows the scheduler to maintain more concurrent requests than contiguous allocation permitted.

Block-based KV cache also enables copy-on-write semantics for beam search and parallel sampling. When a prompt fans out to N completions, the prompt’s KV cache blocks are shared via reference counting across all N branches. Blocks are only physically copied when a decode step produces a divergence between branches, the same semantic as fork() in Unix processes. This can reduce KV cache memory for parallel sampling by close to N times.

Chunked Prefill and Phase Disaggregation

With iteration-level scheduling and PagedAttention in place, the next bottleneck became visible. Long prompts cause prefill spikes: a single iteration that runs a large prefill, say 8,000 tokens, takes an order of magnitude longer than a normal decode step, adding unpredictable latency to every other in-flight request waiting for the current iteration to finish.

Chunked prefill, explored in the Sarathi-Serve paper (Agrawal et al., 2024) and later added to vLLM as an experimental feature, addresses this by splitting long prefills across multiple decode iterations. Each iteration interleaves a chunk of a new request’s prefill with the ongoing decode steps of existing requests. Sarathi reported up to 4x reduction in P99 decode latency at equivalent throughput levels.

A more architectural approach is prefill-decode disaggregation, where the two phases run on separate GPU pools. Prefill workloads are compute-bound and benefit from high FLOPs capacity; decode workloads are memory-bandwidth-bound and benefit from high HBM bandwidth. The Splitwise paper (Patel et al., 2024) from Microsoft Research explored this separation at the cluster level, and Kimi/MoonShot implemented a similar architecture called Mooncake at production scale.

SGLang’s RadixAttention added another layer on top of these scheduling improvements: a radix tree data structure for tracking shared KV cache prefixes across requests, not just within a single request’s beam search. When multiple requests share a common prefix, as they do with shared system prompts or few-shot examples, the KV cache blocks for that prefix are computed once and reused. SGLang reported 5x throughput improvement on multi-turn workloads with shared prefixes.

HuggingFace’s TGI (Text Generation Inference) implements continuous batching with a router process managing the request queue and model shards running on GPU workers. The 4-8x throughput improvements they report on production workloads reflect this full stack of optimizations. NVIDIA’s TRT-LLM calls their implementation “in-flight batching” and ships a comparable block-based KV cache manager, reporting 2-4x improvements on Llama-2-70B on H100 hardware.

What makes continuous batching interesting from a systems programming perspective is how many of its component ideas are direct imports from classical OS and database research. Virtual memory paging, copy-on-write semantics, scheduling at the granularity of individual operations rather than whole jobs, the distinction between latency-bound and throughput-bound workloads: these were well-understood in CPU scheduling and memory management long before transformer models existed.

The contribution of this line of work was recognizing that the autoregressive decode loop, one token per forward pass, provides a natural scheduling boundary, then systematically applying the OS toolkit within that constraint. PagedAttention is paging. RadixAttention is a prefix trie with reference counting. Chunked prefill is time-slicing to prevent long jobs from monopolizing the processor. Prefill-decode disaggregation is heterogeneous scheduling based on resource profile.

The HuggingFace article works through the derivation from first principles, and that is its main value: showing the reasoning chain rather than just presenting the solution. Each optimization in this stack emerged from a clearly identified bottleneck left by the previous one, and each borrowed a known tool from systems design to address it. Reading through the history is more useful than looking only at what current systems do, because the pattern of how these ideas compound is what transfers to the next problem.