LLM Inference as a Queueing Problem: The Theory Behind Continuous Batching
Source: huggingface
The Hugging Face writeup on continuous batching, originally published in November 2025, explains the mechanics clearly. What it does not frame is the theoretical context: LLM inference is a queueing problem, and every optimization in the serving stack maps to a known result in queueing theory. That frame makes it easier to reason about tradeoffs that pure implementation descriptions leave opaque.
Requests Are Jobs with Unknown Service Times
A production LLM serving endpoint looks like this from a systems perspective: requests arrive according to some arrival process, each request requires a variable amount of compute proportional to prompt length plus output length, and GPU capacity is the shared resource. This is a G/G/1 queue in queueing theory notation: general arrival distribution, general service time distribution, one server.
The key property of LLM inference that makes scheduling hard is that service times are unknown at admission time. You know the prompt length, which determines prefill cost, but decode length is unknowable until the model emits an end-of-sequence token. This makes LLM serving similar to web request scheduling in the early 2000s, where response size was unknown at connection time, and it creates the same failure mode.
Head-of-Line Blocking
Static batching produces head-of-line blocking, a well-studied problem in packet switching and web serving. In a static batch, the slowest request determines when all GPU resources are released. A request that needs 500 more tokens holds its batch slot occupied for every request that finished in 10 tokens. Those short requests are blocked behind the long one, even though the GPU could have admitted new work.
This is precisely the same failure mode as HTTP/1.1 processing requests on a single connection in order. The solution in networking was HTTP/2 multiplexing and QUIC streams. The solution in LLM serving is continuous batching.
The Orca paper (Yu et al., OSDI 2022) named this iteration-level scheduling. The scheduler makes a new batching decision at every forward pass, not once per request group. Completed requests exit immediately; waiting requests enter at the next available slot. This is equivalent to what queueing theory calls a work-conserving scheduler: the server is never idle when there is work waiting, and no slot is held by a finished job.
Orca reported a 36.9x throughput improvement at the 99th percentile latency compared to static batching. The magnitude reflects how badly the baseline was broken, not some breakthrough in compute efficiency.
Little’s Law and the Throughput-Latency Tradeoff
Little’s Law states that in a stable queueing system, the average number of jobs in the system equals the arrival rate multiplied by the average time each job spends in the system:
L = λ × W
Where L is average concurrency (jobs in system), λ is throughput (jobs per second), and W is average latency per job. This identity holds for any stable queue regardless of arrival or service distributions.
For a fixed GPU memory budget, continuous batching lets you increase L by eliminating wasted memory from padding and early-finished requests blocking slot release. Increasing L at fixed λ decreases W by the same proportion. Alternatively, increasing L at fixed W increases λ, which is throughput.
This is why the throughput improvement from continuous batching is real and not just a benchmark artifact. You are not squeezing more FLOPs from the hardware; you are eliminating artificial constraints on how many requests can occupy the system simultaneously.
The Utilization Cliff
Queueing theory also predicts a failure mode that practitioners hit quickly after deploying continuous batching. As server utilization approaches 100%, mean latency diverges. The classic result for M/M/1 queues is:
W = 1 / (μ - λ)
Where μ is the service rate and λ is the arrival rate. As λ approaches μ, W approaches infinity. In practice, latency grows superlinearly before that, and tail latency (p99, p999) spikes earlier.
LLM serving has a harder version of this problem. The decode phase is memory-bandwidth-bound, not compute-bound. As the batch fills with in-progress decode requests, memory bandwidth becomes the bottleneck and per-token latency for all requests increases. Running at 95% GPU memory utilization does not mean 5% overhead; it can mean 2-3x higher token latency because the attention computation is bandwidth-saturated.
In vLLM, you can observe this empirically by watching the --gpu-memory-utilization target alongside the avg_generation_throughput metric. A common mistake is setting memory utilization to 0.95 or higher to maximize concurrency, then finding that p99 latency degrades badly under load. The system is operating past the knee of the utilization curve.
A more principled approach is to pick a latency SLO, say 50ms per output token, and find the maximum throughput that satisfies it under realistic load. This is standard capacity planning, but it is often skipped because benchmark results are measured at throughput maximums rather than SLO-bounded operating points.
The Prefill Discontinuity
One thing the word “continuous” undersells is what happens during the prefill phase. A long prompt, say 10,000 tokens for a RAG pipeline retrieving several documents, requires one full forward pass with all 10,000 tokens. During that pass, every ongoing decode request is stalled. Time-to-first-token for the long-prompt request is slow; latency variance for all other users spikes.
Chunked prefill breaks the long prompt into chunks of a fixed size, typically 512 to 2048 tokens, and interleaves prefill chunks with decode steps across multiple iterations. This smooths the latency impact at the cost of slowing down the chunked request’s prefill completion. The tradeoff is configurable in vLLM via --enable-chunked-prefill and --max-num-batched-tokens.
The deeper fix, which is becoming common in production deployments, is disaggregated inference: separate pools of GPUs handle prefill and decode, each tuned for their respective workload. Prefill is compute-bound and benefits from high-FLOP cards; decode is memory-bandwidth-bound and benefits from high-HBM-bandwidth cards. DistServe (Zhong et al., 2024) implements this disaggregation and reports significantly better tail latency at equal throughput compared to co-located prefill and decode.
From a queueing theory perspective, this is pipeline decomposition: splitting a service with a heterogeneous resource bottleneck into stages that each optimize for a single resource constraint. It is the same reasoning that motivates separating CPU-bound and I/O-bound work into different thread pools in a web server.
PagedAttention and the Memory Queue
The KV cache adds a second queueing dimension that continuous batching alone does not address. Each in-flight request holds GPU memory proportional to the tokens it has generated so far; that memory cannot be released until the request completes. If the KV cache fills up, the scheduler must either pause new request admission or evict in-flight requests to CPU memory, both of which hurt throughput.
For a model like Llama-2-7B with 4096-token context, each in-flight request consumes roughly 2 GB of KV cache. A single A100 with 80 GB has enough space for around 24 concurrent requests at maximum context, after accounting for model weights.
The vLLM team’s PagedAttention (Kwon et al., SOSP 2023) addresses fragmentation. Rather than pre-allocating contiguous memory for the maximum possible sequence length, it divides KV cache into fixed-size blocks (16 tokens per block is typical) and allocates them on demand. Completed requests release blocks immediately. vLLM measured 60-80% KV cache waste in contiguous-allocation systems; PagedAttention reduced internal fragmentation below 4%. The throughput improvement, 2-4x over TGI at the time, comes almost entirely from fitting more concurrent requests into the same memory budget.
Tuning for Your Workload
Workload distribution matters more than any single configuration parameter. A chatbot with short average output lengths operates differently from a code generation service where outputs can run thousands of tokens. For the chatbot workload, throughput is typically limited by prefill capacity; many short decode sequences complete quickly, freeing slots, but arrivals are bursty. For the code generation workload, long decode sequences accumulate and memory pressure becomes the bottleneck.
A few practical parameters in vLLM that map directly to the queueing model:
# Maximum number of tokens across all requests in a single forward pass
--max-num-batched-tokens 4096
# Maximum number of concurrent sequences
--max-num-seqs 256
# GPU memory fraction reserved for KV cache
--gpu-memory-utilization 0.85
# Enable chunked prefill to interleave with decode
--enable-chunked-prefill
Setting --max-num-batched-tokens controls how much compute you allocate to prefill versus decode in each iteration. A lower value reduces prefill stalls at the cost of slower time-to-first-token for long prompts. Setting --gpu-memory-utilization below 0.90 leaves headroom to avoid the utilization cliff under burst traffic.
What the Queueing Frame Gives You
The practical upshot is that continuous batching is not a magic throughput multiplier; it is a necessary baseline that removes artificial constraints, after which normal queueing theory governs what happens. Once you have continuous batching and PagedAttention in place, the system behaves like a well-designed work-conserving queue, and the same tools apply: utilization measurement, arrival distribution modeling, SLO-based capacity planning, and pipeline decomposition for heterogeneous workloads.
The Hugging Face post is a good introduction to the mechanics. The queueing frame adds the predictive model that lets you reason about what happens as load increases, why tail latency degrades faster than mean latency, and how to set operating parameters without relying on trial and error.