KV Cache Fragmentation and the Virtual Memory Solution That PagedAttention Brought
Source: huggingface
Continuous batching, described well in the HuggingFace post from November 2025, solves GPU utilization by re-evaluating batch composition at every forward pass instead of holding requests together until the slowest one finishes. The Orca paper (OSDI 2022) first formalized this as iteration-level scheduling and reported up to 36.9x throughput improvement over static batching baselines. Most explanations stop at the scheduling layer, but continuous batching exposes a second problem that it cannot itself resolve: dynamic KV cache allocation at high concurrency generates severe memory fragmentation, and addressing that problem is where the more interesting engineering lives.
The Fragmentation Problem Continuous Batching Exposes
KV cache memory is the binding resource in LLM serving. For Llama-2-7B, storing the key and value projections across all layers consumes approximately 16 KB per token in float16: 2 × 32 layers × 32 heads × 128 head dim × 2 bytes. A 4096-token active sequence occupies roughly 64 MB of KV cache. On an A100 with 80 GB total, after model weights (14-28 GB for a 7B parameter model), you have somewhere between 50 and 60 GB available for KV cache, which is enough for several hundred simultaneous long-context sequences in theory.
Under naive contiguous allocation, the server must reserve the maximum possible sequence length at request admission, because output length is not known in advance. A server configured for a 4096-token ceiling reserves 4096 token-slots per request regardless of how many tokens the response actually uses. A request that generates 50 output tokens after a short 100-token prompt occupies roughly 150 slots of actual content and leaves 3946 reserved but unused. The vLLM paper (Kwon et al., SOSP 2023) measured the real-world consequence: 60-80% of KV cache memory wasted under contiguous allocation in production workloads.
Continuous batching sharpens this problem rather than alleviating it. Static batching keeps batch composition fixed, so allocations and deallocations happen in synchronized waves. Continuous batching allows requests to enter and exit at arbitrary times, which means allocations and deallocations occur asynchronously across the full KV cache pool. The physical memory fragments exactly as a heap allocator fragments when allocations of varying sizes are interleaved with frees, without compaction.
PagedAttention: Virtual Memory for the KV Cache
The solution vLLM introduced is PagedAttention, which applies the virtual memory abstraction from operating systems to KV cache management. The OS virtual memory analogy maps precisely onto this problem. Physical RAM is divided into fixed-size pages. Each process has a virtual address space backed by a page table that maps virtual pages to physical frames. Allocation happens at page granularity. Fragmentation is bounded because the worst-case waste per allocation is one frame minus one byte.
PagedAttention works identically. Physical GPU KV cache memory is divided into fixed-size blocks, each holding 16 tokens of KV data. Each active sequence has a logical block table mapping to physical blocks, allocated one at a time as the sequence grows. When a sequence completes, its physical blocks are freed immediately, without any batch-level coordination.
Sequence of 7 tokens (block_size=4):
Logical block 0 -> Physical block 7 (tokens 0-3, full)
Logical block 1 -> Physical block 2 (tokens 4-6, partial)
Physical block layout after this sequence completes:
Block 0: [free]
Block 1: [free]
Block 2: [free] <- returned immediately
...
Block 7: [free] <- returned immediately
Internal fragmentation, the only remaining source of waste, occurs only in the last physical block of each sequence, where the final few token-slots may go unused. vLLM measured fragmentation below 4% under this scheme, compared to 60-80% under contiguous allocation. The block table does add an indirection layer: every attention computation must traverse the table to locate physical block addresses rather than computing offsets into a contiguous buffer. This requires custom attention kernels, which vLLM ships alongside the allocator. The overhead is small relative to the memory efficiency gains.
Copy-on-Write Prefix Sharing
The OS analogy extends into copy-on-write semantics. When an OS forks a process, the child initially shares all of the parent’s physical memory pages read-only. A physical copy is made only when either party writes to a shared page. This keeps fork cheap when the child immediately calls exec without modifying much state.
PagedAttention supports the same mechanism for shared KV cache prefixes. Two requests with an identical system prompt can reference the same physical KV blocks for that prefix, read-only. The blocks are only copied when the sequences diverge.
System prompt: 500 tokens, occupying 32 physical blocks
Sequence A (system prompt + user query A):
Blocks 0-31 -> Physical blocks 100-131 (system prompt, shared)
Block 32 -> Physical block 150 (A's unique content)
Sequence B (same system prompt + user query B):
Blocks 0-31 -> Physical blocks 100-131 (same physical blocks)
Block 32 -> Physical block 151 (B's unique content)
In a chatbot deployment where every conversation starts with a 500-token system prompt, those 500 tokens exist in physical KV cache memory exactly once regardless of how many concurrent conversations are active. At 16 KB per token, 500 tokens cost 8 MB of KV cache; without sharing, 100 concurrent users would collectively hold 800 MB of identical data. The savings compound quickly as system prompts grow longer, which they do in agentic and RAG deployments where the context includes retrieved documents or tool definitions.
SGLang’s RadixAttention (Zheng et al., 2024) generalizes this from prefixes to arbitrary shared substrings using a radix tree. Agentic workloads often involve multiple active sessions that share overlapping but non-prefix conversation histories: a common tool-call sequence, a shared document excerpt, or a recurring few-shot example block that appears at different positions across requests. The radix tree allows the KV cache for any shared subtree to be reused, not just strict leading prefixes, which can substantially reduce KV cache pressure in these workloads.
Where the Architecture Is Heading
PagedAttention and prefix sharing bring KV cache memory efficiency close to optimal for single-machine serving. The remaining structural inefficiency requires a different kind of solution.
Prefill and decode phases have fundamentally different hardware profiles. Prefill processes the entire prompt in a single pass and is compute-bound: the quadratic attention over the full input sequence saturates GPU arithmetic units. Decode generates one token per step and is memory-bandwidth-bound: each step loads the accumulated KV cache to produce one output token, achieving arithmetic intensity far below peak TFLOPS. Running both phases in a mixed continuous batch means neither reaches optimal utilization, because the hardware is being asked to do two incompatible things at once.
The research direction is disaggregation: separate GPU pools for prefill and decode, connected by high-bandwidth interconnect to transfer KV cache blocks between stages. DistServe (Zhong et al., UC San Diego, 2024) and Splitwise (Patel et al., Microsoft Research, 2024) both follow this pattern. Prefill machines are provisioned for compute; decode machines are provisioned for memory bandwidth. Each pool scales independently according to traffic distribution. KV cache blocks are transferred via RDMA after the prefill phase completes, making the block-addressed paging scheme a prerequisite rather than an optimization.
This is the logical extension of the PagedAttention model. Treating KV cache as discrete blocks with well-defined addressing made them a transferable unit within a single machine. Disaggregation treats those same blocks as network objects, moving them between specialized machines the same way NUMA-aware allocators move pages between memory domains. The abstraction that solved fragmentation on a single GPU also happens to be the right abstraction for distributing work across a cluster.
Practical Configuration Implications
For anyone deploying vLLM or TGI today, the memory arithmetic above translates into concrete parameter choices. Three settings dominate the serving behavior:
--max-num-batched-tokens: the token budget per forward pass. Too small and prefill throughput degrades; too large and a single large prefill monopolizes a forward pass, spiking decode latency for in-flight requests.--max-model-len: the maximum admitted sequence length. Reducing this from 32768 to 4096 frees eight times the KV cache per sequence slot, allowing proportionally more concurrent requests at the cost of rejecting long-context inputs.--gpu-memory-utilization: the fraction of available GPU memory allocated to the KV cache pool. Setting this too high leaves insufficient headroom for CUDA kernels, activations, and temporary buffers.
The right values depend on your workload’s output-length distribution. A deployment serving short API calls has different optima than one handling long-document summarization. The HuggingFace post builds a solid foundation for understanding why the mechanisms exist; the fragmentation numbers explain why the defaults are set where they are and when changing them is worth the tradeoff.