· 7 min read ·

What the Orca Paper Left Unsolved and How PagedAttention Finished the Job

Source: huggingface

The HuggingFace post on continuous batching from first principles, originally published in November 2025, walks through the scheduling mechanics clearly: how naive batching pads to the longest sequence, how that wastes compute, how ragged batching and chunked prefill address the inefficiency. It covers the scheduler-level story well. The memory side, which determines how many sequences can actually fit in GPU memory at once, gets less attention, and in practice it is at least as important.

Both problems had to be solved for high-throughput LLM serving to become viable. They were solved by different papers, in different years, and understanding how they interact is what makes the full picture coherent.

Why LLM Batching Is Different from Standard Inference

Standard deep learning inference batching is straightforward. A batch of fixed-size inputs goes through a stateless forward pass and produces outputs. Image classification, BERT encoding, embedding generation: the model holds no per-sample state between calls, and batching is mostly a question of fitting inputs into memory.

Autoregressive LLM generation breaks every part of that model. Each output token attends to all previous tokens through the attention mechanism. Every decode step requires reading back the key-value (KV) cache, which stores the K and V matrices for all prior tokens, for every sequence in the batch. Output lengths are unknown at request time. A batch that starts with eight sequences can, after 100 decode steps, have five sequences that finished and three still running. Under request-level scheduling, those three sequences hold their GPU slots until completion, and new requests queue behind them.

The waste compounds: padding to match the longest sequence wastes token-level compute; idle slots from finished sequences waste capacity; waiting requests accumulate in queue while GPUs run at partial utilization.

The Orca Paper and Iteration-Level Scheduling

The Orca paper (Yu et al., OSDI 2022) named and solved the scheduling side of this problem. Rather than forming a batch and running it to completion, Orca’s key contribution was scheduling at iteration granularity: after every single forward pass, the scheduler runs. Sequences that produced an <eos> token are removed immediately. New requests from the queue are admitted. The next forward pass runs over the recomposed batch.

This is what “continuous batching” means precisely: the batch composition changes every iteration, early-finishing sequences are replaced without delay, and GPU utilization stays high because there is never a gap waiting for the slowest request in a static batch. Orca’s experiments on GPT-3 scale models reported up to 36.9x higher throughput compared to static batching baselines at the same latency SLO.

Iteration-level scheduling solved GPU idling from uneven sequence lengths; it did not address how GPU memory was allocated for the KV caches those sequences required.

The KV Cache Fragmentation Problem

For a model like LLaMA-2-7B (32 layers, 32 attention heads, 128 head dimensions), the KV cache stores two float16 values per dimension per head per layer per token. That works out to approximately 16 KB per token. A single 2048-token sequence requires roughly 32 MB of KV cache. With 80 GB of GPU memory and the model weights occupying around 14 GB at float16, the remaining capacity sounds generous, until you account for how that memory gets allocated.

Before the vLLM paper, KV caches were allocated contiguously per sequence. Since output length is unknown at request time, systems had to pre-allocate for the maximum possible length (wasting memory for short outputs) or allocate incrementally and risk that contiguous blocks of the required size were unavailable due to fragmentation from freed sequences. Measurements from the vLLM authors (Kwon et al., SOSP 2023) showed 60-80% of KV cache memory being wasted in practice from fragmentation, over-reservation, and internal waste combined.

Effective batch sizes were far smaller than the hardware could theoretically support. The scheduling improvement from Orca was real, but fragmentation in the memory allocator kept batch sizes low enough to limit what the scheduler could actually deliver.

PagedAttention: OS Paging Applied to KV Caches

PagedAttention, the central contribution of vLLM, applies a standard OS abstraction to this problem. Instead of requiring contiguous KV cache storage, the cache is divided into fixed-size blocks (pages), each holding the K and V vectors for a fixed number of tokens, typically 16 to 32. Each sequence has a block table mapping logical block indices to physical block locations in GPU memory, exactly as a process’s page table maps virtual addresses to physical memory frames.

As a sequence generates tokens, new physical blocks are allocated on demand from a free pool. When a sequence finishes, its blocks are immediately returned to the pool. Physical blocks from different sequences interleave freely in GPU memory; the block table handles the indirection. Memory waste drops to under 4%: only the last partially-filled block per sequence represents unused space.

A secondary benefit follows for parallel sampling workloads. When generating multiple outputs from the same prompt (beam search, top-k parallel sampling), the prompt’s KV cache blocks are shared via reference counting across all derived sequences. A copy-on-write triggers only when a derived sequence needs to write to a shared block. The vLLM paper reported up to 10x higher throughput than HuggingFace Transformers for beam search workloads, largely from this block sharing.

The combined effect: vLLM achieves 2-4x higher throughput than HuggingFace Transformers at the same latency, and up to 2.2x higher throughput than Orca’s implementation, which had retained contiguous KV cache allocation.

Chunked Prefill: Handling Phase Asymmetry

A third problem emerges at scale. The prefill and decode phases have fundamentally different computational profiles. Prefill processes all input tokens in parallel; it is compute-bound and runs at high arithmetic intensity. Decode generates one token per sequence per step; it is memory-bandwidth-bound with low arithmetic intensity. A thousand-token prompt can take tens to hundreds of milliseconds to prefill. During that time, the decode pipeline stalls, and P99 latency spikes for all ongoing sequences.

Chunked prefill addresses this by splitting large prompts into fixed-size chunks (typically 256-512 tokens) and interleaving one chunk per iteration with ongoing decode steps. A 1024-token prompt becomes four chunks spread over four iterations instead of one long stall. Decode keeps running, P99 latency stays predictable, and the prefill completes incrementally. There is a small throughput cost since chunked prefill is slightly less efficient than monolithic prefill, but the latency smoothness is worth it for most production workloads. vLLM enables this by default via --enable-chunked-prefill.

Prefix Caching and KV Block Reuse

The block structure of PagedAttention creates a natural extension: requests that share a common prefix can share the same physical KV cache blocks without recomputing them. A system prompt used across all requests is computed once; its blocks are pinned and reused by every subsequent request. SGLang’s RadixAttention extends this to arbitrary tree-shaped prompt reuse using a radix tree to track which block sequences have been computed and can be shared across branching conversation structures.

Cloud providers have made this visible at the API level. Anthropic’s prompt caching, Google’s context caching, and OpenAI’s cached input tokens all reflect the same underlying mechanism. The KV blocks for the cached prefix are retained between calls; savings compound with prefix length.

The Current Framework Landscape

All major serving frameworks now implement some combination of continuous batching and paged memory management, with different optimization targets.

vLLM is the most widely deployed open-source option: PagedAttention, continuous batching, chunked prefill, prefix caching, speculative decoding, and multi-modal support are all included. It is the default starting point for self-hosted LLM deployment.

TGI (Text Generation Inference) from HuggingFace powers their inference API. Written in Rust for the HTTP layer with Python model execution, it supports flash attention and paged attention and integrates quantization (GPTQ, AWQ, bitsandbytes) directly. It prioritizes deployment simplicity.

TensorRT-LLM from NVIDIA calls their implementation “in-flight batching,” the same concept under a different name. Built on TRT’s kernel fusion and FP8 quantization for H100 hardware, it typically achieves the highest raw throughput on NVIDIA accelerators due to hardware-specific optimizations.

SGLang from Berkeley targets LLM program workloads rather than single-request serving. RadixAttention makes it efficient for agent loops, multi-turn conversations, and workloads where prompt structures branch and share prefixes heavily.

Where the Remaining Constraints Are

Two directions are drawing active research attention. Disaggregated serving runs prefill and decode on separate GPU clusters sized for their respective computational profiles: compute-bound prefill on fewer, more powerful GPUs; memory-bandwidth-bound decode on more GPUs with large memory budgets. Papers including Splitwise (Microsoft) and DistServe (PKU) quantify the throughput gains; the bottleneck shifts to KV cache transfer bandwidth between clusters.

KV cache compression addresses memory footprint directly. INT8 or FP8 quantization of the KV cache roughly halves its size with minimal quality degradation, effectively doubling the batch size that fits in a given GPU memory budget. Grouped-query attention (GQA), used in LLaMA-2-70B with 8 KV heads against 64 query heads, similarly reduces KV cache size per token at the architecture level, reducing the memory pressure that PagedAttention manages.

The HuggingFace blog post is a clean entry point to the scheduling side of this problem. The memory management side, from contiguous allocation to paged blocks to prefix sharing, is what determines whether the scheduling efficiency translates to real throughput in a production deployment. Orca set the scheduling foundation; PagedAttention built the memory layer on top of it. Neither alone would have been sufficient.

Was this interesting?