· 7 min read ·

LLM Inference Scheduling Is Just OS Memory Management Again

Source: huggingface

The Hugging Face post on continuous batching, originally published in November 2025, does a solid job explaining the mechanics from scratch. But the more interesting story is not just what continuous batching is. It is why the field needed it, and why the solutions that followed look so much like problems computer scientists solved forty years ago in operating systems.

LLM inference has an unusual structure. Every request goes through two distinct phases. During the prefill phase, the model processes the entire prompt in a single forward pass. All prompt tokens are computed in parallel; this phase is compute-bound and relatively fast per token. During the decode phase, the model generates output tokens one at a time, each requiring its own forward pass. The decode phase is memory-bandwidth-bound, not compute-bound, because most of the GPU’s FLOPs are wasted waiting for data to move from HBM.

This two-phase structure creates a scheduling problem that naive batching handles badly.

The Static Batching Trap

In traditional batch inference, you collect a group of requests, pad all their sequences to the same length, and run them through the model together. This works well for classification or embedding tasks, where output length is fixed. For generative tasks, it falls apart for two reasons.

First, padding is expensive. If one request in your batch of eight has a 1000-token prompt and the others have 50-token prompts, you are computing 950 dummy tokens per short sequence. The wasted compute scales with batch size and prompt length variance, which in practice is extreme.

Second, batch blocking kills throughput. A batch cannot release finished sequences until every sequence in the batch has generated its end-of-sequence token. A request that finishes in 20 tokens holds GPU memory and compute resources until the slowest request in the batch finishes, even if that takes 500 tokens. With unpredictable output lengths, the slowest request determines the latency for everyone.

GPU utilization under static batching tends to sit at 30-40% for generative workloads. The GPU is largely idle, waiting.

Continuous Batching: Iteration-Level Scheduling

The fix came from a 2022 OSDI paper: Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu et al.). The insight is to schedule at the granularity of a single forward pass rather than a full request lifetime.

Every iteration, the scheduler builds a fresh batch:

Each forward pass:
  1. Include all in-progress decode requests (one new token each)
  2. Fill remaining compute budget with prefill chunks from the waiting queue
  3. Run one forward pass on this combined batch
  4. For any request that emitted <eos>: free its memory, admit a new request
  5. Repeat

There is no padding because sequences of different lengths are concatenated into a single ragged input. Each token knows its position within its own sequence via a position ID array, and a block-sparse attention mask prevents tokens from different sequences from attending to each other. A completed request exits immediately, and a waiting request can enter at the very next iteration. No slot-waiting.

Orca reported a 36.9x throughput improvement over request-level scheduling at the 99th percentile latency. Hugging Face’s Text Generation Inference (TGI) implemented the same idea around the same time. The throughput gains are real and reproducible.

But continuous batching exposed the next bottleneck: the KV cache.

The KV Cache Is the Real Bottleneck

During attention, each token produces Key and Value vectors that every subsequent token needs to attend over. Without caching these, each decode step would require recomputing K and V for the entire sequence history, making inference quadratic in sequence length. The KV cache stores these vectors in GPU memory so that each decode step only computes K and V for the new token.

For a model like Llama-2-7B (32 layers, 32 heads, 128 head dimensions, float16), the KV cache consumes:

2 × 32 layers × 32 heads × 128 dims × 2 bytes = 524,288 bytes ≈ 512 KB per token

At a 4096-token context, that is 2 GB of KV cache per sequence. With 16 concurrent sequences, you have consumed 32 GB just for KV state, before accounting for model weights.

The allocation strategy for this cache matters enormously. TGI and early systems used contiguous pre-allocation: reserve a block of GPU memory large enough for the maximum possible sequence length upfront. This is simple but brutal on memory efficiency. Most of the reserved memory sits unused because you cannot know at request admission how long the output will be. The vLLM paper measured 60-80% KV cache memory waste in contiguous-allocation systems.

PagedAttention: Virtual Memory for the GPU

The vLLM team’s response was PagedAttention (Kwon et al., SOSP 2023). The analogy to OS virtual memory is not superficial.

In OS virtual memory, physical RAM is divided into fixed-size pages. Processes use a virtual address space, and a page table maps virtual pages to physical frames. Physical frames are allocated on demand, not upfront. Non-contiguous physical storage is fine because the page table handles translation.

PagedAttention does the same thing for GPU KV cache:

  • KV cache is divided into fixed-size blocks, each holding K and V vectors for a fixed number of tokens (e.g., 16 tokens per block).
  • A block table maps logical sequence positions to physical GPU memory blocks.
  • Blocks are allocated on demand as tokens are generated.
  • When a sequence completes, its blocks are freed and returned to the pool immediately.
  • Physical blocks need not be contiguous.

The result: near-zero fragmentation. The vLLM paper reported internal fragmentation under 4% versus 60-80% in prior systems. vLLM achieved 2-4x higher throughput than HuggingFace Transformers and up to 2.2x higher throughput than TGI, measured on LLaMA-13B with ShareGPT workload traces on A100 GPUs.

The improvement comes almost entirely from being able to fit more concurrent sequences into the same GPU memory budget.

Copy-on-Write: An Old Friend Returns

OS paging brought another trick along: copy-on-write, the mechanism that lets fork() share physical memory between parent and child processes until one of them writes.

Beam search and parallel sampling both generate multiple candidate sequences from a shared prefix. In a contiguous-allocation system, each candidate gets its own complete copy of the KV cache from the shared prompt, even though those copies are identical. With PagedAttention’s block table, multiple sequences can point to the same physical KV blocks for the shared prefix. A copy is only made when the sequences diverge, at the moment a block needs to be written.

This is copy-on-write semantics applied to GPU memory, solving a problem that naive implementations handle with redundant memory copies.

RadixAttention: Shared Libraries for Prompts

The most recent evolution takes the analogy further. SGLang’s RadixAttention (Zheng et al., 2024) generalizes KV cache sharing beyond a single request’s beam search candidates.

Many workloads share long common prefixes: a system prompt used by thousands of users, a document that multiple queries reference, a chain-of-thought template applied to many examples. Standard continuous batching with PagedAttention still recomputes and allocates fresh KV cache for each request’s copy of the shared prefix.

RadixAttention maintains a radix tree over the KV cache. Every stored prefix is a node in the tree. When a new request arrives, the system walks the tree to find the longest matching prefix and reuses those physical blocks directly, allocating only for the divergent suffix. This is conceptually similar to how shared libraries work: multiple processes map the same physical pages for a shared .so file, and private memory starts only at the point of divergence.

For chat workloads where a system prompt might consume 2000 tokens, RadixAttention can eliminate the prefill cost entirely for the shared portion on cache hits.

Chunked Prefill and the Remaining Friction

One problem that PagedAttention does not fully solve is the prefill stall. A long prompt, say 10,000 tokens, monopolizes the GPU for one complete forward pass. During that pass, all decode-phase requests are blocked. Latency spikes for everyone sharing the serving instance.

Chunked prefill addresses this by splitting large prompts into chunks of m tokens, processing one chunk per iteration alongside ongoing decode requests. This trades a higher time-to-first-token for the chunked request against lower latency variance for all other users. Most production systems make chunk size configurable, with values around 512-2048 tokens being common.

Speculative decoding sits at the other end of the pipeline: a small draft model generates k candidate tokens in parallel, the large target model verifies all k in one forward pass, and accepted tokens are emitted. This improves the compute utilization of the memory-bandwidth-bound decode phase by amortizing each forward pass across more output tokens. Combining speculative decoding with continuous batching and PagedAttention is where current research is active.

What the Pattern Reveals

The trajectory from Orca through PagedAttention to RadixAttention follows the same progression OS memory management followed from the 1960s through the 1980s: fixed partitions to dynamic allocation to paging to copy-on-write to shared libraries. The constraints are different (HBM bandwidth instead of disk swap, attention masks instead of MMU hardware), but the shape of the problem is nearly identical.

This suggests something useful about where inference optimization goes next. The OS path after shared libraries included demand paging, memory-mapped files, huge pages, and NUMA-aware allocation. The LLM serving analog of demand paging, where KV cache blocks are evicted to CPU memory under pressure and restored on demand, already exists in some systems. NUMA-aware KV cache placement across multi-GPU configurations is an active area.

The memory management intuition is a productive frame for this work. The algorithms are different in detail, but the underlying reasoning transfers.

Was this interesting?