· 6 min read ·

The Scheduling Problem Hiding Inside LLM Serving Throughput

Source: huggingface

The HuggingFace blog on continuous batching, published in November 2025, frames the topic from first principles: start with KV caching, layer in scheduling, fix the memory fragmentation. Looking back at it now, what stands out is how cleanly the entire stack maps onto operating systems concepts from five decades ago. The LLM serving community spent years optimizing model weight computation, then rediscovered that the real bottleneck was a resource management problem that OS designers solved in the 1970s.

Prefill and Decode Are Different Hardware Problems

Every autoregressive inference request runs through two distinct phases. Prefill processes the entire prompt in a single forward pass, saturating compute. Decode generates one token per forward pass, using a cached representation of all prior tokens; it is memory-bandwidth-bound, with arithmetic intensity close to 1 FLOP per byte, well below the hardware roofline on any modern accelerator.

On an A100 80 GB, peak FP16 compute is 312 TFLOPS and HBM bandwidth is 2 TB/s. Decode on LLaMA-7B reads roughly 14 GB of model weights every forward pass. At 2 TB/s, that is around 7ms of memory transfer per decode step, minimum, regardless of batch size. You can add more sequences to the decode batch and the per-step time barely increases, because the weights are being read regardless; additional sequences just consume slightly more of the compute budget while memory transfer stays flat. This is the core fact that makes batching in the decode phase nearly free, up to the point where compute becomes the binding constraint.

That compute crossover happens at batch sizes of roughly 200 to 500 for LLaMA-7B on an A100, depending on sequence length. Below that threshold, more concurrent sequences give throughput gains at nearly zero latency cost.

The Convoy Problem Returns

Static batching groups requests into fixed batches, runs each batch until all sequences finish, then starts the next batch. The failure mode is straightforward. Output lengths vary widely. A request generating 10 tokens finishes 200 iterations before one generating 2,000 tokens. In a static batch, the finished requests sit idle, their GPU slots padding with zeros, until the longest sequence is done. Meanwhile, requests in the queue wait even though the hardware has available capacity.

CPU schedulers solved this problem with preemptive multitasking in the early 1970s. The convoy effect was well understood: short jobs blocked behind long jobs in batch queues, starving interactive processes. The fix was to preempt at regular intervals and let the scheduler decide what runs next.

The Orca paper from OSDI 2022 applied the same logic to LLM serving and called it iteration-level scheduling. Rather than waiting for a batch to complete before admitting new requests, the scheduler checks after every forward pass whether any sequence has finished (generated an end-of-sequence token or reached its maximum length). Finished sequences are evicted immediately; new requests from the queue fill their slots on the next iteration. The batch stays full. Orca reported a 23.5x throughput improvement over FasterTransformer at the same latency SLO on GPT-3-scale models. The gain is largest when output length variance is high, which matches real workloads: on the ShareGPT dataset, mean output length is around 350 tokens with a standard deviation near 400.

KV Cache as a Memory Allocation Problem

Continuous scheduling solves the convoy problem but leaves memory management unaddressed. The key-value cache is what makes decode efficient, and its memory footprint is substantial. For each token in each sequence, the model stores Key and Value tensors for every attention head in every layer.

For LLaMA-2-7B in float16:

2 (K + V) × 32 layers × 32 heads × 128 head_dim × 2 bytes = 524,288 bytes

Roughly 512 KB per token. On an 80 GB A100 with 14 GB consumed by model weights, about 64 GB is available for KV cache, supporting approximately 125,000 cached tokens across all concurrent sequences. That budget evaporates quickly with long outputs and large batch sizes.

The naive allocation strategy reserves the maximum possible KV cache for every request at admission time, assuming full use of the max_new_tokens budget. Most requests do not approach that budget. A sequence configured for 2,000 output tokens that actually generates 150 ties up 1,850 tokens worth of reserved memory for its entire lifetime in the batch. Pre-allocation utilization in typical workloads runs between 20% and 40% of available memory.

vLLM’s PagedAttention from SOSP 2023 addresses this directly by applying OS virtual memory concepts to KV cache management. KV cache is divided into fixed-size blocks, each holding cache for 16 tokens by default. Blocks are allocated on demand as tokens are generated; there is no reservation at request start. A block table maps logical block indices per sequence to physical block locations in GPU memory, exactly as a page table maps virtual pages to physical frames. Physical blocks can be non-contiguous; the block table handles the translation inside the modified attention kernel.

The memory access pattern becomes less sequential than a contiguous cache layout, but the savings are significant. vLLM reports KV cache utilization above 96%, compared to 20-40% for pre-allocation. From the vLLM paper, throughput on LLaMA-13B was 14 to 24x higher than HuggingFace Transformers and roughly 2 to 4x higher than an Orca-style system without paged allocation.

Block-based allocation also enables memory sharing across sequences. When multiple requests share a common prefix, the KV cache blocks for that prefix can be shared at no additional memory cost. Beam search variants from the same root sequence share blocks until their outputs diverge, at which point copy-on-write semantics separate them. The vLLM paper measured up to 55% memory reduction for beam search workloads.

Ragged Batching and Chunked Prefill

Two refinements address remaining inefficiencies that neither iteration-level scheduling nor PagedAttention fully resolve.

Ragged batching eliminates padding overhead when a new prefill request joins an existing decode batch. Adding a 100-token prefill to a batch of 8 decoding sequences costs, in the padded approach:

(100 - 1) × (8 - 1) = 693 padding tokens

693 wasted compute tokens just to maintain a rectangular tensor shape. The HuggingFace blog derives this formula clearly. Ragged batching concatenates all token sequences into a flat buffer without padding and uses a block-diagonal attention mask to prevent cross-sequence attention. Each token attends only to tokens in its own sequence; the packed layout removes all padding overhead.

Chunked prefill addresses what Sarathi-Serve, a 2024 paper from Microsoft Research and CMU, quantified as the prefill stall. A long prompt processed in a single forward pass takes tens to hundreds of milliseconds. Every decoding sequence in the batch is blocked during that time, increasing their per-token latency. The fix is to split prefill requests into chunks of at most m tokens and interleave those chunks with decode steps. This bounds the maximum stall any decoding sequence can experience from a newly arriving prefill. Time-to-first-token increases slightly for the chunked request since its prompt processes over multiple iterations, but tail latency for active sequences drops substantially.

The Systems Framing

The HuggingFace blog treats continuous batching as an inference engineering technique, which it is. The deeper observation is that everything in this stack has an OS counterpart. Iteration-level scheduling is preemptive multitasking. PagedAttention is demand paging with a page table. Prefix caching is shared memory segments for common data. Copy-on-write on diverging beam search branches is copy-on-write fork semantics. Chunked prefill is a time-slice bound on long-running tasks.

None of these parallels are accidental. LLM serving is a resource management problem: a GPU is a scarce resource shared among many concurrent tenants with heterogeneous request durations and memory needs. Operating systems spent decades developing the abstractions for exactly this kind of problem. The LLM serving literature largely rediscovered them under different names.

The HuggingFace blog is worth reading for how it derives the padding formula and the memory budget scheduling approach from first principles. Pair it with the Orca paper and the vLLM paper for the experimental validation, and the full picture comes together as something that any systems programmer will find immediately recognizable.

Was this interesting?