The Scheduling Insight That Made Production LLM Serving Viable

The problem with naive LLM serving is structurally identical to a problem operating systems solved decades ago. You have a shared resource under contention from multiple consumers, each of whom needs it for a different and unpredictable amount of time. Early OS schedulers handled this by running each process to completion before scheduling the next one, which meant short jobs waited behind long ones and the CPU sat idle when a process blocked. Modern schedulers preempt processes at fixed time intervals and rotate through the queue. LLM serving arrived at the same conclusion by a different path.

Naive LLM serving collects a batch of requests, pads them all to the same token length, and runs every decode step on the full batch until every sequence has generated its final token. A request that generates 10 tokens holds its slot until the 500-token request in the same batch finishes. As sequences complete, the effective batch shrinks. GPU utilization falls in a staircase to zero, then the cycle restarts. The waste is measurable and direct.

The fix is iteration-level scheduling, now broadly called continuous batching. A November 2025 HuggingFace post walks through the implementation from first principles, covering KV caching, chunked prefill, and ragged batching as the three mechanisms. Those details are worth understanding, but to see why the abstraction is the right one and what it costs, it helps to know where it came from and what it created downstream.

The Orca Paper

The formal origin is the Orca paper (Yu et al., OSDI 2022, Seoul National University and FriendliAI). It framed LLM serving as a scheduling problem and proposed iteration-level scheduling as the solution.

Orca’s key observation is that transformer inference has two structurally different phases. The prefill phase processes the full prompt in a single forward pass: O(n^2 * d) compute, highly parallelizable. The decode phase generates one token at a time: O(n * d) per step, memory-bandwidth-bound, and requiring many passes. These phases have different resource profiles and wildly different completion times across requests, which makes static batching a poor fit.

The scheduler Orca proposed is straightforward in concept. At the boundary of every forward pass, check which sequences generated an end-of-sequence token. Remove them. Fill their slots with new requests from the queue. Proceed. The batch is never locked in for the full duration of any request; it is continuously refreshed at the finest granularity the system supports.

Against FasterTransformer, NVIDIA’s static batching baseline for GPT-3-class models, Orca reported up to 36.9x higher throughput. That number is large because the baseline is genuinely bad when output lengths vary, which they always do in production. The gains scale directly with output length variance.

The Three Mechanisms

The HuggingFace post identifies three components that make iteration-level scheduling work in practice. They compose as a stack rather than functioning independently.

KV caching is the foundation. Without it, every decode step would recompute attention over the full context from scratch, making compute cost O(n^2) per token generated. KV caching stores the Key and Value projections from all previous tokens so each new decode step only computes attention for the single new token. The cost is memory. For Llama-2-7B, each token occupies roughly 16 KB of KV cache across 32 layers (2 x 32 layers x 32 heads x 128 head dim x 2 bytes per float16). For a 4096-token context that is 64 MB per sequence. KV cache memory becomes the binding constraint on batch size.

Ragged batching eliminates padding waste. Naive batching pads all sequences to the length of the longest one in the batch. With one 500-token sequence and seven 50-token sequences, you compute on 3,150 tokens that contribute nothing to the output. Ragged batching concatenates sequences into a single flat tensor with a block-diagonal attention mask where each sequence attends only to its own tokens. Padding waste drops to zero at the cost of more complex attention kernels.

Chunked prefill manages the interference between new requests and in-flight decodes. A long prompt processed as a single prefill pass blocks all currently-decoding sequences for its full duration. Chunked prefill splits the prompt into fixed-size pieces, one chunk per iteration, which bounds the latency hit on decoding sequences. The individual request takes more passes to complete prefill, but no single pass monopolizes the batch.

The Memory Problem Continuous Batching Creates

Iteration-level scheduling surfaces a new problem. If the batch composition changes every forward pass, you need a KV cache allocator that handles sequences of varying and unpredictable lengths efficiently. Naive contiguous allocation fragments badly: the researchers behind vLLM (Kwon et al., SOSP 2023) found that 60 to 80 percent of KV cache memory was wasted due to fragmentation and over-reservation under contiguous allocation.

Their solution, PagedAttention, applies virtual memory paging to the KV cache. Physical memory is divided into fixed-size blocks, each holding key and value vectors for a fixed token count, typically 16 tokens per block. Sequences get a logical page table mapping to physical blocks, allocated on demand as the sequence grows. Fragmentation drops below 4 percent. The same block structure enables prefix sharing: two requests with an identical system prompt can reference the same physical KV cache blocks read-only, halving memory use for that prefix.

Sequence A (length 7, block_size=4):
  Logical block 0 -> Physical block 7  (tokens 0-3)
  Logical block 1 -> Physical block 2  (tokens 4-6, partially filled)

Sequence B (shares system prompt with A):
  Logical block 0 -> Physical block 7  (same physical block, read-only)
  Logical block 1 -> Physical block 9  (diverges here)

On LLaMA-13B on an A100, vLLM achieved 23 to 24x throughput improvement over HuggingFace Transformers and 3.5x over TGI. Numbers vary with output length distribution and hardware configuration, but the order-of-magnitude gains over static batching are consistent across independent benchmarks.

The Tension Continuous Batching Does Not Resolve

Iteration-level scheduling solves GPU idleness but introduces a different conflict. Prefill is compute-bound. Decode is memory-bandwidth-bound. Running them together in the same batch means neither achieves optimal utilization. A long prefill increases time-to-first-token for all in-flight decoding sequences. A large running batch increases inter-token latency for all sequences by increasing memory traffic per step. Chunked prefill softens the prefill problem without eliminating the underlying resource conflict.

The direction the field is moving is prefill-decode disaggregation: separate GPU pools for each phase, each optimized for its resource profile. DistServe (UC San Diego, 2024) and Splitwise (Microsoft Research, 2024) both proposed this architecture. Prefill machines process prompts and transfer KV cache to decode machines over RDMA. Each pool is sized independently for the actual workload distribution. The phases no longer compete for batch slots or GPU resources, at the cost of network transfer overhead and cluster heterogeneity.

Stanford’s SGLang (2024) took a complementary approach. Its RadixAttention uses a radix tree to generalize KV cache sharing to arbitrary shared substrings rather than only common prefixes. In agentic workloads where partial conversation histories overlap in non-prefix patterns, this substantially reduces KV cache pressure and indirectly eases the prefill-decode tension.

The Current Landscape

vLLM is the default open-source serving stack, with TGI as the HuggingFace-native alternative. Both implement continuous batching with chunked prefill and paged KV cache management. NVIDIA’s TensorRT-LLM calls the same concept in-flight batching and provides highly optimized CUDA kernels at the cost of a considerably more complex build system. SGLang is competitive for structured generation and multi-turn agentic workloads. All of them treat iteration-level scheduling as a baseline assumption, not a differentiator.

Speculative decoding adds another layer of complexity to the scheduler. A small draft model proposes multiple tokens per step; the main model verifies them in a single pass. The scheduler has to account for variable token acceptance rates when planning batch composition, since accepted tokens extend sequence length unexpectedly and rejected ones require regeneration. vLLM, TGI, and TensorRT-LLM all added speculative decoding support in 2024, with varying approaches to integrating it with continuous batching.

The scheduling insight from the 2022 Orca paper is now infrastructure. The questions being actively worked out are the second-order problems: how to disaggregate prefill and decode efficiently at cluster scale, how much prefix sharing can reduce KV cache memory demands in real workloads, and how schedulers should handle preemption when memory fills. The HuggingFace post describes the mechanism at the level you need to understand what existing systems are doing. The design space above those mechanisms is where the research is active, and continuous batching is the foundation everything else is built on.