· 7 min read ·

How Continuous Batching Turned LLM Serving Into an OS Scheduling Problem

Source: huggingface

The transformer decode loop runs one step at a time. Each forward pass produces exactly one new token per sequence, and the process repeats until the sequence hits an end-of-sequence token or a length limit. This sequential, variable-length nature is what makes LLM serving fundamentally different from most inference workloads, and it is the root cause of why naive batching strategies fall apart under real load.

Static Batching and Its Costs

In a conventional deep learning inference server, you collect a set of requests into a batch, run a forward pass, and return results. This works acceptably when inputs and outputs have similar sizes. With language models, they rarely do.

A static batch waits until every sequence has finished generating before it can admit new requests. If you have eight sequences in a batch and seven finish after 50 tokens but one runs to 500 tokens, those seven slots sit idle for 450 decode steps. The GPU is reading KV cache memory and running attention for a single remaining sequence while seven slots deliver nothing.

Padding compounds this. GPU tensor operations require uniform shapes, so you pad every sequence in a batch to match the longest one. Short sequences carry dead weight through every forward pass. For a batch with one 500-token sequence and seven 50-token sequences, the majority of compute in each step touches padding.

For prompt processing (the prefill phase), static batching is tolerable because you process all input tokens in parallel. The real damage accumulates in decoding, which is sequential by nature and produces wildly variable output lengths across requests.

The Orca Insight: Schedule at the Iteration Level

The fix came from the Orca paper (OSDI 2022), by Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun at Seoul National University and FriendliAI. The paper introduced what they called iteration-level scheduling: instead of scheduling at the granularity of complete requests, schedule at the granularity of individual decode steps.

The key observation is that after each forward pass, the server knows exactly which sequences are complete. Rather than waiting for the full batch to drain, it can immediately remove finished sequences and admit new ones. The batch changes composition after every single step.

In rough pseudocode, the scheduler looks like this:

running = []
waiting = deque(incoming_requests)

while True:
    # One decode step for all running sequences
    outputs = model.forward(running)

    # Release finished sequences
    for seq in running[:]:
        if seq.is_done():
            running.remove(seq)
            deliver(seq.result())

    # Admit new sequences up to capacity
    while len(running) < MAX_BATCH_SIZE and waiting:
        seq = waiting.popleft()
        if has_kv_cache_capacity(seq):
            running.append(seq)

This is a scheduler in exactly the sense that an OS process scheduler is: a tight loop that decides which work to run next based on current resource availability. Idle slots get filled without ceremony. There is no fixed batch boundary to wait for.

The throughput gains are most pronounced when request lengths vary widely, which describes most real workloads. Orca reported order-of-magnitude throughput improvements over static FIFO batching in high-variance configurations. When all requests are exactly the same length, continuous batching offers no advantage, but uniform length distributions do not appear in practice.

KV Cache: The Resource the Scheduler Has to Manage

The scheduler’s most constrained resource is KV cache memory. Every token a transformer processes generates key and value tensors cached for future attention steps. These caches are large. For a 70B parameter model running at 16-bit precision, each token’s KV cache across all transformer layers typically runs to a few hundred kilobytes. A 4,096-token sequence in such a model can require over a gigabyte of KV state.

In static batching, you can preallocate KV cache for the maximum sequence length because you know the batch composition upfront. In continuous batching, sequences enter and leave asynchronously, and you generally do not know how long any given output will be. You have to allocate KV cache incrementally as tokens are generated and free it exactly when sequences complete.

The original Orca system handled this with a straightforward contiguous allocator. It worked, but it introduced fragmentation: freed regions of different sizes left gaps that could not be reused efficiently for new requests of different lengths. A long sequence might be blocked not because total free memory was insufficient but because no contiguous block large enough existed.

PagedAttention and the Virtual Memory Analogy

vLLM’s PagedAttention (SOSP 2023, Woosuk Kwon et al.) addressed fragmentation by applying an idea that operating systems solved decades ago. Instead of requiring contiguous physical memory for each sequence’s KV cache, PagedAttention divides the KV cache into fixed-size pages (commonly 16 tokens per page) and maintains a block table mapping logical positions in a sequence to physical pages.

Sequence position:  [0-15]    [16-31]   [32-47]   [48-63]
Block table:           ↓         ↓         ↓         ↓
Physical block:    [block 7] [block 2] [block 15] [block 31]

The physical blocks are non-contiguous. Fragmentation shrinks from O(sequence length) to O(block size). Freed blocks return to a pool and get reassigned to any new sequence regardless of their previous occupant. The vLLM team reported KV cache memory utilization in the 80-90% range with PagedAttention, versus the 20-40% typical of contiguous allocation under realistic request distributions.

PagedAttention also enables copy-on-write for beam search and parallel sampling. When multiple candidate sequences share a common prefix, they can share the physical pages for that prefix rather than duplicating them. This halves memory usage for two-beam search and scales further for larger beam widths and multi-sample generation.

The scheduler built on top of PagedAttention also handles preemption. If KV cache runs out while sequences are mid-generation, vLLM can swap lower-priority sequences out to CPU DRAM and resume them later, rather than dropping the requests entirely. This is again the same vocabulary as virtual memory: eviction, swap space, page faults, just applied to a different resource.

The Prefill/Decode Tension

Continuous batching exposed a tension that static batching had obscured: prefill and decode have fundamentally different computational profiles, and mixing them in the same batch creates interference.

Prefill is compute-bound. Processing a 2,000-token prompt runs attention over 2,000 tokens in parallel, saturating GPU arithmetic units. Modern H100 GPUs can process large prefills in tens of milliseconds because the parallelism matches what the hardware is designed for.

Decode is memory-bandwidth-bound. Each step processes a single new token but reads the entire KV cache for all running sequences. As sequences grow longer, decode steps take more time because more data moves between GPU high-bandwidth memory and the compute units. A 4,000-token sequence requires reading four times as much KV cache per decode step as a 1,000-token sequence.

When a large prefill request enters the batch alongside running decode sequences, it can delay the next decode step by a significant margin. From the perspective of a user mid-conversation, this shows up as a sudden stall between tokens, which is perceptible even when average throughput is high. The prefill computation monopolizes GPU arithmetic capacity for the duration of the forward pass.

The response to this is chunked prefill, which vLLM, TGI, and SGLang have all adopted. Instead of processing a full prompt in one forward pass, the scheduler splits it into chunks (typically 512 or 1,024 tokens) and interleaves those chunks with decode steps from running sequences:

Step N:   [prefill chunk 1/4 of req A] + [decode of req B, req C, req D]
Step N+1: [prefill chunk 2/4 of req A] + [decode of req B, req C, req D]
Step N+2: [prefill chunk 3/4 of req A] + [decode of req B, req C, req D]
Step N+3: [prefill chunk 4/4 of req A] + [decode of req B, req C, req D]
Step N+4: [decode of req A, req B, req C, req D]

This spreads the prefill cost across four steps rather than concentrating it in one, which smooths decode latency for existing sequences. The tradeoff is that req A’s time-to-first-token increases slightly, since it must wait through four interleaved steps before it begins decoding. Chunk size becomes a tunable parameter: larger chunks favor raw throughput, smaller chunks favor tail latency for sequences already running.

The HuggingFace Post in Context

The HuggingFace blog post on continuous batching, published in November 2025, offers a clean first-principles walkthrough of the core mechanics. Looking at it from early 2026, it captures a moment when the field had absorbed the Orca insight and incorporated it into every serious serving stack, with the scheduling refinements still being worked out.

The open frontier is disaggregated prefill and decode: running prefill on one pool of GPUs and decode on another, transferring KV cache tensors across high-speed interconnects like NVLink or InfiniBand between the two pools. This lets you scale compute capacity for prefill independently from memory bandwidth capacity for decode, which matters as context windows extend to hundreds of thousands of tokens and the two phases diverge further in their resource demands. Projects like Mooncake from Moonshot AI explore this architecture, and similar disaggregated designs appear in DeepSeek’s published infrastructure work.

Prefix caching adds another dimension. If multiple users send requests that share a long common prefix (a system prompt, a document being analyzed, a code file), the server can avoid recomputing KV cache for that prefix entirely. SGLang’s RadixAttention generalizes this to arbitrary tree-structured prefix sharing rather than requiring exact prefix matches, which matters for multi-turn conversations and agentic workloads where context evolves incrementally.

The Orca scheduler, refined through PagedAttention’s memory management and chunked prefill’s interference mitigation, is now table stakes for any production serving system. What remains is making the scheduler smarter about tradeoffs between time-to-first-token, throughput, tail latency, and memory efficiency as context windows grow and workload patterns diversify. The problem is still, at its core, the same kind of problem that operating systems have been refining for sixty years: who runs next, for how long, and on which resources.

Was this interesting?