· 6 min read ·

The Straggler Problem: How Iteration-Level Scheduling Fixed LLM Serving

Source: huggingface

The key insight behind continuous batching is simple enough to state in one sentence: you should not wait for the slowest sequence in a batch to finish before processing new requests. Everything that follows is about what it actually takes to act on that insight. The HuggingFace post from November 2025 builds the concept from first principles, and it is worth using that as a starting point to trace the full chain of problems and solutions that followed.

Static Batching and the Straggler Problem

In standard neural network inference, batching is straightforward. Group N inputs, run them through the model together, return N outputs. This works well for image classifiers or BERT-style encoders where every input produces exactly one output of a fixed size. Autoregressive language models break this assumption in two ways.

First, each inference step produces one token, and the model must be called again to produce the next one. A 500-token completion requires 500 sequential forward passes. Second, different sequences produce different numbers of tokens, and you cannot know in advance how many steps any particular request will need.

Static batching handles this by treating the batch as a unit: all requests enter together, all run until the longest sequence is done, and all exit together. The shorter sequences finish early and then sit idle, holding their GPU memory allocation until the last sequence generates its stop token. If 15 out of 16 requests finish at token 100 but one runs to token 1,000, all 16 slots are locked for the entire 1,000-step run. The throughput for those 15 completed sequences is zero for the remaining 900 steps.

The problem compounds when you consider memory. To handle the worst-case output length, you must pre-allocate KV cache memory for every sequence up to the maximum sequence length. The KV cache stores the key and value tensors from every attention layer for every token processed so far. For a model with L layers, H attention heads, head dimension D, and batch size N at maximum length T:

memory = 2 × N × T × L × H × D × bytes_per_element

For LLaMA 2 70B (80 layers, 64 heads, head dimension 128), a batch of 16 sequences at 4096 max length in fp16 requires roughly 275 GB for the cache alone. The model weights need another 140 GB. You are reserving all of that up front regardless of how long each sequence actually runs.

Iteration-Level Scheduling

The paper that changed this was Orca: A Distributed Serving System for Transformer-Based Generative Models, published at OSDI 2022 by Yu et al. from Seoul National University. The core contribution was scheduling at the iteration level rather than the request level.

After each decode step, the serving system inspects which sequences just produced a stop token. Those sequences are immediately evicted from the batch. New requests from the waiting queue are inserted to fill those slots. The batch changes composition at every step.

Orca reported throughput improvements of 10x to 36x over FasterTransformer, the dominant serving framework at the time. The specific gains depended on workload variance: high variance in output lengths is exactly where static batching suffers most and where iteration-level scheduling gains the most.

PagedAttention: Making Dynamic Batching Memory-Efficient

Continuous batching exposed a problem that static batching had simply avoided through fixed allocations. When sequences enter and leave the batch at every step, you need to allocate and free KV cache memory dynamically.

The naive approach is to pre-allocate a contiguous block per sequence sized to the maximum possible output length. This wastes memory for short sequences and causes fragmentation: freed blocks from short sequences are rarely the right size for incoming ones.

vLLM, published at SOSP 2023 by Kwon et al. from Berkeley, solved this with PagedAttention. The idea is borrowed from OS virtual memory. Rather than one large contiguous block per sequence, vLLM divides the KV cache into fixed-size pages (16 tokens per page by default). Each sequence gets a page table mapping logical token positions to physical pages. Pages are allocated on demand and freed immediately when a sequence finishes.

# Conceptual illustration of PagedAttention's key lookup
def paged_attention(query, key_cache, value_cache, block_tables, block_size):
    # block_tables: [seq_id, logical_block_idx] -> physical_block_idx
    # key_cache: [num_blocks, block_size, num_heads, head_dim]
    # Sequences reference non-contiguous physical blocks via their page table
    for seq_id, logical_block_idx in enumerate_blocks(query):
        physical_block = block_tables[seq_id][logical_block_idx]
        k = key_cache[physical_block]
        v = value_cache[physical_block]
        # attend query to these k,v with custom CUDA kernel

The custom CUDA kernels are the hard part. Standard attention assumes the KV tensors for a given sequence are stored in contiguous memory. PagedAttention requires kernels that gather from arbitrary physical locations at runtime. vLLM measured less than 4% memory waste under realistic workloads, compared to 60-80% waste with static allocation.

Prefill and Decode Are Different Problems

A request’s life has two distinct phases. During prefill, the entire prompt is processed in one forward pass, computing attention across all input tokens in parallel. This phase is compute-bound: the GPU’s arithmetic units are the bottleneck. During decode, the model generates tokens one at a time, and each step reads the full KV cache for all prior tokens. This phase is memory-bandwidth-bound: the GPU’s memory bus is the bottleneck.

These phases interact badly in a continuous batching system. When a long prompt is being prefilled, it consumes a disproportionate share of the forward pass compute, stalling decode steps for other in-flight sequences. If you prioritize decode throughput and deprioritize prefill, new requests wait longer before their first token appears, increasing time-to-first-token.

Chunked prefill addresses this by breaking prefill computation into fixed-size chunks. Instead of processing a 4,096-token prompt in one massive forward pass, the system processes 512 tokens per step and interleaves each chunk with decode steps for other sequences. This smooths the compute demand and reduces the latency spikes that long prompts cause for in-flight requests.

HuggingFace’s Text Generation Inference (TGI) adopted continuous batching early and implemented chunked prefill as part of its scheduler. What was a research contribution in 2022 is now the default behavior in every serious serving framework: vLLM, TGI, SGLang, and others all build on iteration-level scheduling as their foundation.

Disaggregation as the Logical Conclusion

If prefill and decode have fundamentally different compute characteristics, they might be better served by different hardware. Prefill benefits from high arithmetic throughput. Decode benefits from high memory bandwidth and the ability to batch many sequences simultaneously. Running both on the same GPU is a compromise that optimizes for neither.

Disaggregated serving separates these onto distinct GPU pools. A prefill cluster processes incoming prompts and populates the KV cache; the filled cache is transferred to a decode cluster that handles token generation. Mooncake from Moonshot AI demonstrated this at scale, reporting meaningful improvements in time-to-first-token and overall throughput by allocating resources proportional to the actual demand from each phase.

SGLang contributes RadixAttention, which caches KV state across requests sharing a common prefix. If many requests use the same system prompt, the KV cache for that prefix is computed once and reused, reducing the marginal compute cost for each new request to just its unique input tokens.

What Each Fix Revealed

This is a common pattern in systems work: the first fix is obvious in retrospect, and it always makes the next bottleneck visible.

Static batching wasted throughput on idle sequences. Continuous batching fixed that and revealed memory fragmentation. PagedAttention fixed fragmentation and revealed prefill-decode interference. Chunked prefill mitigated interference but left hardware underutilized in mixed workloads. Disaggregation allows each phase to run on hardware matched to its actual requirements, but introduces KV cache transfer costs and scheduling complexity across pools.

The serving stack for LLMs is several layers deep into this process. The HuggingFace article builds continuous batching from scratch using clean abstractions, and that is genuinely useful for understanding why iteration-level scheduling is the right primitive. But the implementation path from that abstraction to a production system running at scale passes through all of the above, and each layer is non-trivial to get right.

The bottlenecks have not run out. Speculative decoding, which uses a small draft model to propose tokens verified by the main model in parallel, introduces its own scheduling complexity: you now have variable-length verify steps that interact with continuous batching in non-obvious ways. Mixture-of-experts models add routing decisions that affect which experts’ weights need to be in cache. The fundamental insight from Orca remains sound; the problems it surfaces keep compounding.

Was this interesting?