· 6 min read ·

Roofline Models and Ragged Batches: The Hardware Logic Behind Continuous Batching

Source: huggingface

The HuggingFace post on continuous batching, originally published in November 2025, builds the concept carefully from attention fundamentals through KV caching, chunked prefill, and ragged batching. It explains what the technique does. What it does not explain is why GPUs benefit so dramatically from it, and why the specific design choices in systems like vLLM and TGI look the way they do. The answer starts with the roofline model, and it shapes everything from KV cache layout to preemption policy.

The GPU Roofline and Why Decode Sits in the Wrong Place

Modern GPUs are characterized by two numbers: peak compute throughput (FLOP/s) and memory bandwidth (bytes/s). Their ratio is the arithmetic intensity threshold above which a workload becomes compute-bound rather than memory-bound. For an NVIDIA A100-80GB running BF16 matrix multiplications, the peak throughput is 312 TFLOP/s and the HBM bandwidth is 2 TB/s, giving a threshold of 156 FLOP/byte. Operations below that intensity finish their arithmetic before the data arrives; they are waiting on memory, not on math.

Autoregressive decode sits far below this threshold at small batch sizes. For a 7B parameter model stored in BF16, loading all model weights requires reading approximately 14 GB from HBM per decode step. The compute performed per batch item is roughly 2 × 7×10^9 = 14 GFLOP. At batch size 1, arithmetic intensity is 14 GFLOP / 14 GB = 1 FLOP/byte, about 156x below the compute-bound threshold. The GPU finishes each multiply-accumulate operation and then waits for the next chunk of weights to arrive.

The fix is straightforward: process more batch items per weight load. Arithmetic intensity scales linearly with batch size because the compute scales with batch size while the weight reads stay constant. At batch size 32, intensity rises to 32 FLOP/byte; at 156 it reaches the compute-bound threshold on an A100. Most production deployments target batch sizes between 32 and 256, which means they occupy the memory-bound region and benefit from every increment in effective batch size.

This is the hardware case for continuous batching. Anything that allows you to run more concurrent sequences within your GPU memory budget directly translates to higher arithmetic intensity, better GPU utilization, and higher throughput.

KV Cache Memory Is the Constraint on Batch Size

The obstacle is KV cache memory. During autoregressive decode, the attention mechanism requires access to the key and value tensors for every token generated so far. These are stored for each sequence, each layer, and each attention head. For a model with L layers, H heads, head dimension D, and current sequence length T, each sequence requires 2 × L × H × D × T elements of storage.

For LLaMA-2-7B (32 layers, 32 heads, head dimension 128) at 2048 tokens in BF16, that is 2 × 32 × 32 × 128 × 2048 × 2 bytes = 1 GB per sequence. Running 32 concurrent sequences at maximum length costs 32 GB in KV cache, plus 14 GB for the model weights, totaling 46 GB of the 80 GB available on an A100-80GB.

Static batching makes this worse than it has to be. To handle variable-length outputs, a static batcher must pre-allocate KV cache for the maximum expected sequence length for every slot. A request that generates 60 tokens still holds 2048 tokens worth of KV cache allocation until the entire batch finishes. On a mixed workload where some requests generate 50 tokens and others generate 1,000, the short requests consume full-length cache allocations for most of their lifetime. Effective batch size is constrained by worst-case output length, not by actual utilization.

Iteration-Level Scheduling as a Memory Policy

The Orca paper (OSDI 2022) reframed the scheduling unit. Rather than treating a request as atomic from arrival to completion, it schedules at the granularity of a single decode iteration: after every forward pass, retire finished sequences and admit new requests from the queue. A sequence that generates 60 tokens holds its KV cache allocation for 60 iterations, then releases it. A sequence generating 1,000 tokens runs for 1,000 iterations, then releases. There is no cross-contamination; no short request subsidizes a long one.

Allocation also becomes incremental. Instead of pre-allocating 2048 tokens of KV cache up front, you allocate one token’s worth per iteration, growing the allocation as the sequence generates output. Peak KV cache usage reflects actual token counts, not maximum theoretical counts. On a workload where average output length is 200 tokens and maximum is 2,000, the difference in sustainable batch size is substantial.

The throughput consequence is direct. vLLM’s 2023 benchmark results showed 2-4x throughput improvements over static batching systems on LLaMA-class models at equivalent latency targets, driven primarily by the higher sustainable batch sizes that iteration-level scheduling enables.

PagedAttention: Eliminating Internal Fragmentation

vLLM extended iteration-level scheduling with PagedAttention, which applies virtual memory concepts to KV cache layout. Even with incremental allocation, contiguous memory allocation for growing sequences causes fragmentation: when a contiguous block runs out, you either copy the sequence’s KV cache to a larger region or pre-allocate extra space to avoid future copies. Both choices waste memory.

PagedAttention divides KV cache into fixed-size pages, defaulting to 16 tokens each. Each sequence maintains a page table mapping logical token positions to physical pages. Pages can be anywhere in HBM; the attention kernel dereferences the page table at access time. When a sequence needs more space, allocate an available page from the free pool, regardless of location. When a sequence finishes, return its pages to the pool immediately.

The vLLM team reported KV cache waste below 4% with PagedAttention, compared to 20-30% waste typical of contiguous allocation schemes. At large batch sizes, that difference translates directly into additional concurrent sequences, which translates into arithmetic intensity and throughput.

The Prefill-Decode Asymmetry

Iterating forward past pure decode reveals a second structural problem. Prefill, the step where the model processes all input tokens in parallel, is compute-bound: all input positions attend to each other simultaneously, generating many FLOP per weight byte loaded. Decode is memory-bound as described above. The two phases have different optimal execution characteristics.

With naive continuous batching, when a new request arrives its prefill step runs as part of the current batch iteration. A long prompt, say 4,000 tokens, consumes an entire forward pass for prefill alone, adding hundreds of milliseconds of latency to every other sequence currently in the decode phase. Time-to-first-token for the new request improves, but the tail latency of existing requests spikes.

Chunked prefill, now available in both vLLM (via --enable-chunked-prefill) and TGI, addresses this by splitting long prefill operations across multiple decode iterations. A 4,000-token prompt might process 256 tokens per iteration, interleaved with decode steps for active sequences. This keeps per-iteration latency bounded and prevents prefill from starving the decode pipeline.

What the Numbers Look Like

On an A100-80GB running LLaMA-2-13B (26 GB of weights in BF16), the memory bandwidth ceiling places an upper bound on decode throughput regardless of batch size. At 2 TB/s bandwidth and 26 GB per iteration, the maximum decode rate is approximately 2,000 / 26 ≈ 77 forward passes per second. At batch size 32, that translates to a theoretical ceiling of about 2,450 tokens per second.

In practice, vLLM with continuous batching on LLaMA-2-13B achieves roughly 1,500-2,000 tokens per second under sustained load, which is close to the memory bandwidth ceiling. Static batching systems on the same hardware at comparable latency SLOs typically achieve 400-600 tokens per second, limited by the lower effective batch sizes that fixed pre-allocation allows.

Where It Still Gets Complicated

Continuous batching does not solve preemption cleanly. If the scheduler overcommits and KV cache runs out mid-batch, it must either swap sequences to CPU memory (slow) or evict and recompute from scratch (expensive). Both options add unpredictable tail latency under high load. Conservative scheduling policies mitigate this but reduce maximum utilization.

Disaggregated serving, where separate GPU pools handle prefill and decode, takes the prefill-decode asymmetry seriously at the infrastructure level rather than patching it within a single server. Papers like Splitwise and DistServe explore this direction, though operational complexity is significant and the throughput gains depend heavily on workload characteristics.

The core design of continuous batching is not complicated: retire sequences when they finish, allocate KV cache incrementally, keep the batch full. The complexity is entirely in the gap between that principle and the practical constraints of HBM capacity, memory fragmentation, and prefill-decode interference. PagedAttention and chunked prefill are the two most important engineering responses to that gap, and both follow directly from the same arithmetic that makes high batch sizes valuable in the first place.

Was this interesting?