· 6 min read ·

The Scheduling Problem at the Heart of LLM Inference

Source: huggingface

A retrospective piece published on HuggingFace in November 2025 walks through continuous batching from first principles, covering KV caching, chunked prefill, and ragged batching as a coherent unit. It is a good explanation of the mechanics. What it understandably compresses is the history: why these techniques exist in the order they do, and how each one was made necessary by the failure mode of the previous approach.

The story starts with a straightforward problem.

The Cost of Thinking in Requests

Before 2022, most LLM serving systems used what is now called static batching or request-level batching. The model receives a fixed batch of requests, processes them together from the first prompt token to the last output token, and only then accepts new requests. This maps cleanly onto how neural network training works, and it simplifies the implementation considerably.

The problem is that language model inference is not symmetric. Different requests generate different numbers of output tokens, and you cannot know in advance how many tokens any given request will produce. When you batch eight requests together and one of them outputs 600 tokens while the other seven finish at 50, those seven slots sit idle for the remaining 550 decode steps, waiting for the longest sequence to terminate.

This is head-of-line blocking applied to transformer inference. The practical consequence is that GPU compute utilization in static batching systems commonly sits at 10 to 40 percent for realistic chat workloads. The hardware is paid for, powered on, and doing nothing useful for the majority of its time.

The Orca paper, published at OSDI 2022 by researchers at Microsoft Research and Seoul National University, named the problem and the fix. The key insight was simple to state: schedule at the granularity of a single forward pass rather than at the granularity of a full request. Every iteration, the scheduler checks which sequences have finished, removes them from the batch, and promotes waiting sequences into the freed slots. Completed sequences release their compute resources immediately instead of holding them until the slowest sibling finishes.

Orca called this iteration-level scheduling. The resulting throughput improvements over FasterTransformer, the dominant static batching framework of the time, were reported at up to 23x on GPT-3-scale workloads with variable output length distributions.

What Iteration-Level Scheduling Does to Tensor Shapes

The scheduling insight is easy to understand. The implementation challenge it creates is less obvious.

Standard batched transformer inference assumes all sequences in a batch have the same length. This makes the tensor math clean: you have a batch dimension, a sequence length dimension, and a model dimension, and they are all uniform. When sequences have different lengths, you pad the shorter ones with dummy tokens to match the longest. This padding is wasteful, but it keeps the tensor shapes regular.

Continuous batching breaks this assumption in two ways. First, sequences join and leave the batch at different points in their lifecycles, so the batch composition changes every step. Second, a batch might contain some sequences in the prefill phase (processing their prompt) and others in the decode phase (generating one token at a time), and these have fundamentally different computational shapes.

The solution is to remove the batch dimension entirely and concatenate all sequences into a single token stream. A batch of three sequences with lengths 4, 2, and 6 becomes a tensor of 12 tokens. Attention masks prevent tokens from different sequences from attending to each other. This is what the HuggingFace article calls ragged batching, and it is the same approach Orca described as selective batching: flatten all tokens across all sequences for the linear projection layers, then handle attention per-sequence using a cumulative sequence length array to track boundaries.

FlashAttention’s variable-length attention kernel (varlen_attn) operationalizes this in practice. TGI uses it directly. The kernel takes the concatenated token tensor and a cu_seqlens array encoding where each sequence starts and ends, then computes attention correctly across the packed representation without any padding overhead.

The Memory Problem That Emerged

Iteration-level scheduling increases throughput by keeping more sequences running concurrently. More concurrent sequences means more KV cache memory, and this created a new bottleneck that the Orca paper did not fully resolve.

The KV cache stores the key and value projections for every token in every layer. For a model like LLaMA-2-7B with 32 layers, 32 attention heads, and a head dimension of 128, the KV cache costs 16 kilobytes per token in float16. A sequence capped at 2048 tokens requires about 1 gigabyte of KV cache allocation. On a 40-gigabyte A100, that means roughly 37 concurrent sequences at most, and that is before accounting for the model weights themselves.

The deeper problem is fragmentation. If you allocate KV cache at maximum sequence length upfront, and most sequences finish at a few hundred tokens, you have reserved gigabytes of memory that will never be used. Studies found that static KV cache allocation wastes 60 to 80 percent of the allocated memory on average. Less available memory means smaller effective batch sizes, which directly limits how much continuous batching can help.

The vLLM paper from SOSP 2023 addressed this with PagedAttention. The idea borrows from OS virtual memory: divide the KV cache into fixed-size blocks, each holding a small number of tokens (16 by default in vLLM), and allocate blocks on demand as sequences generate tokens. A page table maps each sequence’s logical token positions to physical blocks scattered across GPU memory. When a sequence finishes, its blocks are immediately returned to a free pool.

With PagedAttention, memory waste drops below 4 percent, because only the last partially-filled block per sequence is wasted. The practical result is that 2 to 4 times as many sequences fit in the same GPU memory, which compounds directly with the throughput gains from iteration-level scheduling. vLLM benchmarked at up to 24x higher throughput than unoptimized HuggingFace serving on OPT-13B and LLaMA-13B.

The Remaining Problem: Prefill Interference

Even with continuous batching and PagedAttention working together, there is a third problem that only becomes apparent at scale. Prefill and decode are computationally asymmetric.

Prefill processes all prompt tokens in a single forward pass. For a 32,000-token prompt, that pass is quadratic in the sequence length and takes several hundred milliseconds. Decode generates one token per forward pass and takes perhaps 20 to 50 milliseconds. When a new long-prompt request enters the batch, its prefill step dominates the entire forward pass, stalling all the currently running decode sequences for the duration.

This produces jitter: decode sequences that were returning tokens steadily every 30ms suddenly pause for 500ms while a large prefill runs. From the perspective of users watching those responses stream, the effect is noticeable.

Chunked prefill, described in the Sarathi-Serve paper from 2023 and later added to vLLM in v0.4.0, addresses this by splitting long prefills across multiple iteration steps. Instead of processing a 32,000-token prompt in one pass, the scheduler processes 2,048 tokens per step alongside the normal decode batch. The partial KV cache from each chunk accumulates across steps, and the sequence enters the decode phase only once the full prompt has been processed.

Sarathi-Serve reported up to 10x reduction in P99 time-to-first-token variance compared to unchunked continuous batching, with throughput remaining equivalent or slightly higher. The tradeoff is scheduling complexity: the scheduler must track prefill progress per-sequence and interleave chunked prefill work against the decode budget every step.

In vLLM’s production configuration, you control this with --enable-chunked-prefill and --max-num-batched-tokens. The latter serves double duty as both the per-step token budget and the effective chunk size for prefill.

The Stack, Not the Technique

The HuggingFace article presents these mechanisms correctly as parts of a unified approach. What is worth emphasizing is that they were not designed together. Each one was forced into existence by the failure mode of the previous approach.

Static batching wasted GPU compute through head-of-line blocking. Iteration-level scheduling fixed that, but made tensor shapes irregular. Ragged batching handled irregular shapes, but required enough concurrent sequences to make the scheduling effective. Effective concurrency was limited by KV cache fragmentation. PagedAttention fixed fragmentation, which revealed that long-prompt prefill was now the remaining bottleneck. Chunked prefill addressed that.

At each layer, the fix worked by relaxing a constraint the previous layer had treated as given. The result in 2025 is that “continuous batching” in production systems like vLLM, TGI, and TensorRT-LLM refers to this entire stack operating together: iteration-level scheduling, ragged token batching, paged KV cache management, and chunked prefill. The HuggingFace article is a useful entry point into understanding why the pieces fit together the way they do.

Was this interesting?