· 6 min read ·

How Iteration-Level Scheduling Unlocked LLM Throughput

Source: huggingface

The naive approach to batched LLM inference groups requests together, pads them to a uniform length, and waits for every sequence in the batch to finish before returning results. It works, and it wastes a substantial fraction of GPU compute on padding while doing so. A HuggingFace technical post from November 2025 builds continuous batching from attention mechanics upward, and it is worth reading on its own terms. What I want to do here is look at the same technique from a different direction: as a scheduling problem whose shape is determined entirely by the constraints the KV cache and autoregressive decoding impose.

Why Static Batching Breaks Under Real Workloads

GPU compute is most efficient with uniform tensor shapes. Static batching satisfies this by padding all sequences to the same length, so every forward pass operates on tensors of equal dimensions. The hidden cost is straightforward: any padding token without real input behind it is wasted computation, and the waste compounds as the batch grows.

The more damaging problem is temporal. Sequences in the same batch finish generating at different times, because prompts vary in length and outputs vary in how many tokens the model produces before emitting an end-of-sequence token. When fast sequences finish while slow ones are still running, the GPU keeps computing forward passes on a partially drained batch. Utilization degrades without any observable signal until you start measuring tokens-per-second against a clock.

With dynamic scheduling, where new requests join the batch as old ones finish, the padding cost has a specific shape. If you have a batch of B sequences and a new prompt of n tokens arrives, fitting it into the existing batch requires (n-1)(B-1) padding tokens to equalize lengths. For a batch of eight sequences receiving a 100-token prompt, that is 693 padding tokens per forward pass, before the new prompt generates a single output token. The cost grows quadratically with both prompt length and batch size.

The KV Cache as the Foundation

Continuous batching depends entirely on the KV cache, so it helps to be concrete about what the cache is and why it exists before looking at how scheduling uses it.

Attention computes softmax(QK^T / sqrt(d)) V, where Q, K, and V are projections of the input sequence. The QK^T term costs O(n^2 d) operations for a sequence of length n. During autoregressive decoding, where the model generates one token at a time, recomputing this from scratch at every step would make long sequences extremely expensive. The fix is to cache the K and V projections from all previous tokens and only compute attention for the new token, reducing per-step complexity from O(n^2) to O(n).

The memory cost is the other side of this tradeoff. For a model with L layers, H attention heads, and head dimension A, each token requires 2 × L × A values in the cache. For Llama-2-7B, with 32 layers and a head dimension of 128, that works out to approximately 16 KB per token in float16. Across a context window of several thousand tokens, the cache consumes a meaningful fraction of GPU memory. This memory budget is the primary resource the continuous batching scheduler has to manage, because every active sequence holds a slice of it for as long as it is generating output.

Chunked Prefill: Fitting Long Inputs Into a Fixed Budget

Before the scheduling loop can operate, there is a sub-problem: what happens when an incoming prompt is larger than the available memory budget for a single forward pass?

The answer is chunked prefill. You split the prompt into chunks of size m, process each chunk in a separate forward pass, and carry the accumulated KV cache states into the next chunk. A prompt of length n requires ⌈n/m⌉ passes to fully prefill. The attention mask for each subsequent chunk is adjusted so that tokens in the new chunk can attend to all previously cached tokens, while maintaining the causal property that no token sees forward into positions it has not yet reached.

This makes arbitrarily long prompts tractable within a fixed memory budget, at the cost of more forward passes per prefill. It is structurally similar to how a database query planner breaks a large sort or hash join into passes that fit within a working memory limit, carrying intermediate state forward between passes rather than materializing the full result at once.

Ragged Batching: Getting Rid of Padding

Traditional batching requires uniform tensor shapes because GPUs expect them. Ragged batching sidesteps this by concatenating sequences together into a single packed sequence and using attention masks to isolate prompts from each other.

The attention mask is a boolean matrix where entry (i, j) is True if token j is allowed to influence token i. For a single prompt, this produces a causal lower-triangular mask: each token attends only to previous tokens in the same sequence. For a ragged batch of multiple prompts concatenated together, each prompt gets its own causal mask block, and all cross-prompt interactions are set to False. The GPU sees a single long sequence; the attention pattern enforces complete isolation between prompts.

The result is that every computation touches a real token. There are no padding positions, no wasted matrix multiplications, and no quadratic cost when new requests join a live batch. The tradeoff is that ragged tensors are dynamically shaped, which makes it harder to apply CUDA graphs or torch.compile, both of which expect static shapes to capture and replay computation graphs. Systems using continuous batching typically fall back to eager execution or use more intricate techniques to recover some of that performance, though the throughput gains from eliminating padding waste generally outweigh the loss.

The Scheduler Loop

With chunked prefill and ragged batching in place, the continuous batching loop becomes:

while requests_pending:
    batch = []

    # Decoding sequences always run: each contributes exactly one new token
    for seq in decode_phase:
        batch.append(seq)

    # Fill remaining memory budget with prefill work
    remaining_budget = memory_budget - len(decode_phase)
    for req in prefill_queue:
        chunk = req.next_chunk(max_tokens=remaining_budget)
        batch.append(chunk)
        remaining_budget -= len(chunk)
        if remaining_budget == 0:
            break

    outputs = model.forward(batch)

    # Retire completed sequences, admit new requests immediately
    for seq in decode_phase:
        if outputs[seq].token == EOS:
            retire(seq)
            admit_next_from_prefill_queue()

Decoding sequences always get priority, because each contributes exactly one token to the forward pass at minimal memory cost. They must make progress on every iteration or the system stalls. Prefill work fills the remaining capacity, chunked to fit the budget. When a sequence finishes, it is retired immediately and a new request takes its slot, so the batch stays full and GPU utilization stays high.

This is iteration-level scheduling, a term from the original Orca paper by Yu et al. (2022), which introduced fine-grained scheduling for transformer inference. The core observation in that paper is that the natural unit of scheduling for autoregressive models is not a complete request but a single forward pass. Operating system schedulers made an analogous transition from batch processing to preemptive time-slicing for the same reason: finer scheduling granularity yields better resource utilization when jobs have variable runtime, and the overhead of finer decisions is small relative to the savings.

What Production Systems Build On Top

This scheduling approach is the foundation of vLLM and Text Generation Inference, the two most widely deployed open-source LLM serving systems. Both maintain continuously updated batches at iteration granularity, which lets them serve thousands of concurrent requests on a single GPU node without the throughput collapse that static batching produces under heterogeneous load.

vLLM extends the approach with PagedAttention, which manages KV cache memory in non-contiguous pages borrowed from virtual memory systems in operating systems. This addresses a problem continuous batching exposes but does not solve: even with dynamic scheduling, long sequences can fragment the KV cache badly enough that the memory budget shrinks faster than you would expect from raw utilization numbers. PagedAttention is the natural next layer on top of continuous batching, and the HuggingFace post mentions it as a forthcoming topic in their series.

The Lesson in the Architecture

What makes continuous batching interesting from a systems design perspective is that it composes three techniques that are individually unremarkable. Caching computed results, processing large inputs in chunks, and packing variable-length data to eliminate padding are all standard practice in storage systems, network stacks, and query engines. The insight was in identifying which unit of granularity to apply them at. In transformer inference, that unit is the forward pass, not the request, and making that shift changes the scheduling problem from one with quadratic waste to one with near-constant utilization.

That kind of insight, where the right abstraction boundary was always present but required a specific context to become visible, shows up often enough in systems work to be worth noticing when it appears.

Was this interesting?