The Memory Problem That Continuous Batching Had to Solve First

Most write-ups on LLM inference stop at “continuous batching lets you add new requests between decode steps” and leave it there. The HuggingFace piece on continuous batching from first principles, originally published November 2025, does a good job building the intuition from scratch. But the scheduling insight is only half the story. Understanding why it required a memory management revolution to actually work at scale is what makes the whole picture click.

The Problem with Static Batching

Before continuous batching, transformer serving systems used static batching: assemble a group of requests, pad them all to the longest sequence length, run the forward pass until every sequence in the batch finishes, then take the next batch. The implementation is straightforward and the GPU sees large, dense matrix operations.

The problem is structural. When one request in your batch needs 20 tokens and another needs 800, the 20-token request finishes and its GPU slot sits idle for the rest of the batch. The system waits for the longest sequence before processing anything new. This is head-of-line blocking at the batch level, and in production workloads where output lengths vary by orders of magnitude, it destroys utilization.

Padding makes it worse. Every sequence is extended to the length of the longest one, so compute cycles are spent on tokens that contribute nothing. GPU memory is allocated for the maximum possible output length upfront, reserving space for outputs that may never materialize.

The Orca Insight

The 2022 OSDI paper “Orca: A Distributed Serving System for Transformer-Based Generative Models” introduced what they called iteration-level scheduling. The idea is that a transformer forward pass generates exactly one new token per sequence in the batch. After each iteration, you can check which sequences have finished (emitted EOS or hit max length), evict those from the batch, and immediately admit new waiting requests. The batch composition changes after every single forward pass rather than being fixed for the duration of a generation run.

This changes the fundamental resource model. Instead of allocating a batch slot that stays occupied until the slowest sequence finishes, you fill the GPU continuously with work. Short requests complete quickly and free their slots; long requests run as long as they need without blocking short ones. Orca reported up to 36.9x throughput improvement over FasterTransformer on OPT-13B at equivalent latency service-level objectives.

The scheduling logic itself is not complicated:

after each forward pass:
  for each sequence in batch:
    if sequence.is_finished():
      free sequence's resources
      move to completed queue
  while waiting_queue and resources_available:
    admit next request into batch
  run next forward pass

What makes this hard in practice is the phrase “free sequence’s resources.” Those resources are the KV cache, and managing them efficiently turned out to be a separate, substantial problem.

The Memory Fragmentation Problem

During a transformer forward pass, each layer computes Key and Value matrices for every token in the context. These are cached so that on subsequent decode steps, only the new token’s K/V needs to be computed. For LLaMA-2-13B with a 4096-token context, the KV cache for a single sequence is roughly 800MB in fp16. An A100 with 80GB of HBM holds about 26GB of model weights in fp16, leaving around 54GB for KV cache, which translates to a hard ceiling on how many sequences can be in flight simultaneously.

The fragmentation problem is this: if you allocate a contiguous memory block for each sequence’s KV cache, you have to choose the size upfront. Allocate for the maximum possible output length and you waste memory on sequences that finish early. Allocate on demand and you end up with external fragmentation as sequences of different lengths are freed and reallocated. The vLLM team measured that systems before their work wasted 60 to 80 percent of KV cache memory to fragmentation.

Continuous batching raises the stakes because you are constantly admitting and evicting sequences. A system that wastes most of its KV cache memory hits its batch ceiling far below the hardware limit, negating much of the throughput gain from iteration-level scheduling.

PagedAttention: The Missing Piece

The vLLM paper (SOSP 2023) solved this with PagedAttention. Borrowing from virtual memory, it stores KV cache in non-contiguous blocks, each holding K and V vectors for a fixed number of tokens (16 tokens per block by default). A sequence’s KV cache is a list of block pointers rather than a contiguous buffer. Memory is allocated one block at a time as tokens are generated, and freed one block at a time when a sequence finishes.

The attention kernel is modified to do indirect addressing through a block table, looking up physical GPU memory locations from logical token positions. The overhead is small; the memory waste drops to under 4 percent. The practical consequence is that vLLM can run batch sizes close to the theoretical maximum given available HBM, rather than a fraction of it. They reported 2 to 24x throughput improvement over static-batching HuggingFace Transformers, with larger gains on longer sequences where fragmentation was previously worst.

PagedAttention also enables copy-on-write for beam search and parallel sampling: sequences sharing the same prompt share KV blocks until they diverge, at which point blocks are copied. This substantially reduces memory for sampling workloads without any changes to the scheduling logic.

What the Scheduler Actually Does Each Iteration

vLLM’s scheduler (vllm/core/scheduler.py) maintains three queues: waiting for requests whose prefill has not started, running for active sequences, and swapped for sequences whose KV blocks have been evicted to CPU memory under memory pressure. Each iteration it runs roughly as follows:

# 1. Try to run all running sequences one more decode step.
#    Preempt lowest-priority sequences if memory is tight.
for seq_group in running_queue:
    if not can_append_slots(seq_group):
        victim = running_queue.pop()  # lowest priority
        preempt(victim)  # swap to CPU or discard and recompute

# 2. Promote swapped sequences if memory has freed up.
for seq_group in swapped_queue:
    if free_blocks >= needed_blocks:
        swap_in(seq_group)

# 3. Admit new requests from waiting queue.
while waiting_queue and budget_remaining:
    candidate = waiting_queue.peek()
    if free_blocks >= ceil(candidate.prompt_len / block_size):
        admit(candidate)

The preemption policy is a tradeoff: swap KV blocks to CPU RAM (expensive if done frequently, but preserves work) or discard and recompute later (cheap in memory, but pays the prefill cost again). vLLM defaults to recomputation for short sequences and swapping for long ones. HuggingFace’s Text Generation Inference uses a token budget constraint, the sum of input plus generated tokens across all sequences in the batch, to bound memory usage and handle mixed prefill-decode batches.

The Prefill-Decode Asymmetry

One issue that iteration-level scheduling surfaces is a latency spike when new requests are admitted. Prefilling a 1000-token prompt is compute-intensive and takes roughly as long as generating 10 to 20 tokens. When a prefill is inserted into a running batch, it adds latency to every other sequence in the batch for that iteration.

Sarathi-Serve (2023) addressed this with chunked prefill: split a long prompt into fixed-size chunks and process one chunk per iteration alongside decode tokens. This smooths out latency spikes at the cost of slightly higher time-to-first-token for the new request. vLLM adopted this as --enable-chunked-prefill starting in v0.3, and it has become the default behavior for latency-sensitive production deployments.

Where the Frontier Has Moved

The latest direction is disaggregated serving: rather than mixing prefill and decode on the same GPU, route them to separate hardware pools. Prefill is compute-bound and benefits from dense tensor cores; decode is memory-bandwidth-bound and benefits from wide memory buses. Running both on the same hardware means each phase underserves the other’s bottleneck. Papers from 2024, including DistServe (OSDI 2024) and Splitwise (ISCA 2024), show meaningful goodput improvements by separating the two phases and transferring KV cache over NVLink or InfiniBand between clusters.

This is the natural endpoint of the logic that started with Orca. Once you decompose generation into scheduling units at the iteration level, it becomes clear that prefill and decode are fundamentally different workloads, and the same hardware optimization profile cannot serve both optimally. The scheduling insight opened a door; continuous batching, PagedAttention, chunked prefill, and disaggregated serving are successive steps through it.