Three LLM Serving Systems, Three KV Cache Strategies

The core insight of continuous batching, iteration-level scheduling, is conceptually simple and now well-understood. HuggingFace’s article on the topic, originally from November 2025, builds it from first principles. What the conceptual treatment doesn’t cover is that the three major open-source serving systems, vLLM, HuggingFace TGI, and NVIDIA’s TensorRT-LLM, have made different implementation choices around KV cache management that produce meaningfully different production behavior. The differences are substantial enough to affect system selection.

The KV Cache Allocation Problem

Continuous batching requires the scheduler to evict completed sequences and admit new ones after every forward pass. Eviction is straightforward when a sequence finishes cleanly. The harder case is preemption: when memory is tight and a running sequence needs to be temporarily suspended to make room for a higher-priority or newly admitted request. What happens to that sequence’s KV cache determines how expensive preemption is.

Two broad strategies exist. The first allocates KV cache as contiguous tensors, one per sequence, with maximum-length reservations. Eviction means discarding the buffer and recomputing from scratch when the sequence resumes, or blocking admission of new requests until memory frees. The second, block-based or paged allocation, divides the KV cache into fixed-size blocks and maps logical sequence positions to physical block addresses through a block table. Preemption means swapping blocks to CPU RAM; resumption means swapping them back, with no recomputation required.

vLLM: PagedAttention and the Block Table

vLLM’s core contribution was PagedAttention (SOSP 2023), a custom CUDA attention kernel that operates on non-contiguous memory. The kernel accepts a block table as an argument, mapping logical token positions to physical blocks in GPU HBM. This makes block-based allocation viable without requiring contiguous memory per sequence.

The data structure is straightforward. A BlockSpaceManager maintains a pool of physical blocks, each holding KV states for B tokens (the default block size is 16). Each sequence has a BlockTable, a list of physical block IDs. When a sequence grows, the manager allocates a new block and appends its ID. When a sequence finishes, its blocks are freed immediately.

Preemption uses this structure directly. When the scheduler determines that memory is insufficient to continue running all active sequences, it preempts the lowest-priority one: copies its physical blocks from GPU HBM to CPU DRAM, frees the GPU blocks, and moves the sequence to a swapped queue. When resources free up, the blocks swap back and the sequence resumes without recomputation. On PCIe 4.0 (roughly 20-40 GB/s), a 512-token sequence in a 7B model carries about 8 MB of KV state and swaps in approximately 0.2-0.4 ms, substantially faster than recomputing a 512-token prefill on a 70B model.

A secondary benefit is copy-on-write semantics for parallel sampling and beam search. When a sequence branches into multiple beams, the beams initially share physical blocks by reference count. Only when a beam writes new KV entries does the block manager copy the shared block. This avoids duplicating the entire prompt’s KV cache for each beam, which matters at long context lengths.

vLLM’s effective KV cache waste is under 4%, bounded to at most one partially-filled block per sequence. In practice this enables sustained batch sizes that are 4-8x larger than contiguous allocation on the same hardware. Benchmarks on LLaMA-13B on an A100-80GB show up to 24x throughput improvement over native HuggingFace Transformers inference and 3.5x over TGI on serving workloads with variable-length outputs. The gap versus TGI is almost entirely attributable to memory utilization.

TGI: Contiguous Allocation with Flash Attention

HuggingFace TGI implements continuous batching with a Rust-based server and Python model execution, using Flash Attention 2 for efficient attention computation. The KV cache is allocated as contiguous tensors per sequence: when a request arrives, TGI allocates a buffer sized to the configured maximum sequence length. Buffers are fixed at admission time.

The tradeoff is simplicity versus memory efficiency. Contiguous allocation means Flash Attention 2’s standard variable-length kernels work directly without modification; there is no block table to maintain. The scheduling implementation is simpler because there is no block manager, and preemption in the default configuration means recompute rather than swap.

The memory cost is real. On a workload where some requests generate 50 tokens and others 1,000, a request generating 50 tokens still holds its full pre-allocated buffer until it finishes. Effective batch size is bounded by the worst-case allocation per slot, not by actual utilization. Measured KV cache memory utilization on mixed workloads is typically 20-40%, compared to vLLM’s sub-4%.

TGI’s strength is operational simplicity and latency on uniform workloads. The server is straightforward to deploy, Flash Attention 2 gives good single-request latency, and for applications where output lengths are predictable and bounded, the memory inefficiency is modest. A workload where every request generates 200-300 tokens will use TGI’s allocated buffers fairly efficiently; a workload mixing 20-token and 2,000-token outputs will waste significant capacity.

TensorRT-LLM: In-Flight Batching with Production Configuration

NVIDIA’s TensorRT-LLM calls the same technique in-flight batching and implements it through a C++ executor API with paged KV cache support. The configuration surface is more explicit:

max_num_tokens: total token budget per forward pass across all sequences, combining prefill and decode. This is the primary knob for bounding iteration compute cost and tail latency.
kv_cache_free_gpu_mem_fraction: fraction of free GPU memory after model weight loading allocated to the paged KV cache pool, defaulting to approximately 0.9.
max_batch_size: upper limit on concurrent sequences in the batch.
max_queue_delay_microseconds: maximum time a request can wait in queue before admission, controlling admission latency under sustained load.

The paged KV cache in TensorRT-LLM operates on similar principles to vLLM, though block size and specific allocation policies differ. NVIDIA reports 2-5x throughput improvements over static batching on LLaMA-2-70B on H100 with in-flight batching enabled. Direct comparison with vLLM on identical hardware is workload-dependent.

TensorRT-LLM requires TensorRT engine compilation for each model before deployment, an additional step that vLLM and TGI skip. This compilation step produces CUDA kernels tuned to the specific model architecture and hardware target, which is where the throughput advantage on NVIDIA hardware comes from. The operational overhead is a meaningful deployment cost for organizations that frequently update or experiment with models.

What the Differences Mean in Practice

The choice between these systems depends on which constraints are binding for your workload.

High output length variance is where vLLM’s PagedAttention advantage is most pronounced. If your workload mixes short and long outputs, contiguous allocation wastes GPU memory in proportion to the variance; paged allocation does not. The custom PagedAttention CUDA kernel adds marginal overhead versus Flash Attention 2’s highly optimized standard path, but the throughput gain from running larger effective batch sizes is much larger than the per-step kernel overhead.

Uniform and predictable output lengths reduce the memory utilization gap between vLLM and TGI. If requests consistently generate within a narrow range, TGI’s simpler allocation strategy wastes proportionally less memory, and its Flash Attention 2 path gives competitive per-step latency. The operational simplicity of avoiding a block manager and preemption scheduler is a real advantage for teams that want a reliable, easy-to-reason-about deployment.

Production NVIDIA deployments targeting strict latency SLOs benefit from TensorRT-LLM’s explicit configuration surface. The max_num_tokens parameter in particular gives a hard bound on per-iteration compute cost that the scheduler-level configuration in vLLM and TGI does not provide as directly. This matters when you need to guarantee P99 time-to-first-token under a specific millisecond budget.

The broader point, which the HuggingFace article correctly emphasizes on the scheduling side, is that continuous batching without paged KV allocation leaves substantial throughput on the table. The scheduling insight from the ORCA paper enabled iteration-level admission and eviction; the PagedAttention insight from the vLLM paper made the memory budget flexible enough to actually fill the batch at all times. Both changes are necessary; neither alone is sufficient.