· 6 min read ·

Why your LLM inference server is wasting a quarter of its GPU time

Source: huggingface

If you have ever profiled an inference server running a serious model, the result is depressing in a familiar way. The GPU, the most expensive component in the rack, sits idle for a surprising fraction of every step while the CPU shuffles tensors around. Hugging Face’s recent post on asynchronous continuous batching puts a number on it: in their 8K-token, batch-32, 8B-parameter benchmark on an H200, the GPU was idle 24% of total runtime waiting for the CPU to assemble the next batch. The fix is not a new kernel or a smarter scheduler. It is a careful application of CUDA primitives that have existed for over a decade.

I want to dig into why this overhead is structural to continuous batching, how the async version works, and where this sits relative to vLLM and TensorRT-LLM, which solved similar problems years ago in different ways.

The serial trap in continuous batching

Continuous batching, as introduced by Orca (OSDI ‘22) and popularized by vLLM, is the technique of splicing requests in and out of a running batch at the granularity of individual decode steps. Instead of waiting for the slowest sequence to finish before starting new work, you swap a finished sequence out and a queued one in on the very next iteration. The throughput gains over static batching are well-documented; the original vLLM paper reports 2 to 4 times higher throughput on real workloads.

What the throughput numbers obscure is the per-step bookkeeping. Every iteration, the CPU has to:

  1. Inspect which sequences finished or hit a stop token.
  2. Pull new requests off the queue and decide which fit into the next batch.
  3. Update KV cache slot mappings and position IDs.
  4. Build new input tensors and push them to the GPU.

In the naive synchronous loop, the GPU launches forward(), the CPU calls torch.cuda.synchronize() or reads back logits to do sampling, then steps 1 through 4 happen, and only then does the next forward() get dispatched. The GPU is twiddling its thumbs through every CPU phase. At small batch sizes the forward pass dominates and you do not notice. At batch 32 with an 8B model on an H200, the forward is short enough that the CPU work becomes a meaningful chunk.

The Hugging Face profile shows a 300.6-second total run with 24% GPU idle. That gap is what the async version targets.

The mechanics: streams, events, and a second slot

The solution will be familiar to anyone who has written a CUDA-based render loop. Run independent work on separate streams and use events for ordering. The post describes three streams:

  • An H2D stream for host-to-device input transfers.
  • A compute stream for the forward pass.
  • A D2H stream for moving logits and output tokens back.

Operations submitted to different non-default streams can execute concurrently on the device. The trap, mentioned briefly in the post but worth underscoring, is the default stream. PyTorch’s default stream has implicit synchronization with all other streams in the legacy semantics, which means a single stray operation on it serializes everything. You have to make sure every transfer and kernel uses an explicit non-default stream and that pinned memory is used for the host buffers so non-blocking transfers actually overlap.

CUDA events handle the ordering. The H2D stream records an event when the input copy finishes; the compute stream waits on that event before launching the forward; the compute stream records its own completion event; the D2H stream waits on that before copying logits back. None of these waits block the CPU, which is the whole point. The CPU only synchronizes at the very end of a step, when it needs the previous step’s outputs to build the next batch.

The second piece is dual buffering. If the GPU is processing batch N while the CPU is writing batch N+1, those two batches cannot share input tensors. You need two slots and you alternate. This costs roughly 2x the input and output tensor memory, which sounds bad until you remember that with FlashAttention the attention mask is no longer materialized and the input tensors are tiny compared to KV cache. The doubled cost is in the noise.

The third piece, and the one I find most elegant, is the carry-over mechanism. A sequence that lives across multiple batches has a problem: the token it generates in batch N is the input it needs for batch N+1, but the CPU is preparing batch N+1 before batch N’s output exists. The solution is to fill those positions with placeholder zeros during CPU prep, then have a tiny on-GPU operation copy the real tokens into place after the forward completes. The CPU never has to wait for the token; the GPU patches its own input tensor in the next stream. This is the kind of trick you only think of after you have stared at a profiler for too long.

The numbers and what they mean

The results: 300.6 seconds drops to 234.5 seconds, GPU utilization climbs from 76.0% to 99.4%. That is a 22% wall-clock improvement against a predicted ceiling of 24% if CPU overhead were perfectly hidden. The remaining 0.6% idle is the unavoidable sync point at step boundaries plus the first-step cold start.

A 22% throughput improvement with no kernel changes, no quantization, no speculative decoding, is the kind of optimization that pays for itself instantly on any GPU fleet. At H200 cloud prices around 4 dollars per hour, a server running at 99% utilization instead of 76% saves roughly 90 cents per GPU-hour on the same workload.

How does this compare to vLLM and TRT-LLM?

This is where the context gets interesting. vLLM has had a version of this for a while. Its scheduler runs in a separate process from the model executor, communicating over shared memory and ZMQ, which naturally decouples request management from forward execution. The V1 architecture announcement leaned even harder into this, pulling the tokenizer, detokenizer, and request manager off the critical path. NVIDIA’s TensorRT-LLM takes a different route and bakes a lot of the batching logic into a C++ runtime that talks directly to CUDA, sidestepping the Python overhead entirely.

The Hugging Face implementation is interesting because Transformers is, by design, the framework people reach for when they want flexibility over peak throughput. Most production deployments graduate to vLLM, TGI, or TRT-LLM precisely because vanilla Transformers has historically left performance on the table. Closing the gap with a Python-side async loop, while keeping the model code unmodified, is the right move; it means research code and small-scale serving get a free 20% without anyone needing to port their custom architecture to a different inference engine.

The code lives in the ContinuousBatchingAsyncIOs class in the Transformers repo. It is worth reading if you ever need to build something similar for a non-LLM workload; the pattern of streams, events, and dual buffers generalizes to any pipeline where CPU coordination overhead becomes a meaningful fraction of step time.

The lesson that keeps coming back

Every few years a team publishes a result that boils down to: we found the place where the CPU was blocking the GPU, and we stopped doing that. The DALI data loading library was this for training. CUDA Graphs was this for kernel launch overhead. Async continuous batching is this for inference scheduling. The pattern is always the same: profile, find the serial dependency that does not need to be serial, and break it with streams and events.

The fact that a 22% improvement was sitting on the floor of one of the most-deployed inference frameworks in 2026 says something about how much performance work remains to be done in the LLM stack. The flashy stuff gets the headlines, speculative decoding, paged attention, MoE routing tricks. The boring stuff, getting the CPU out of the GPU’s way, often delivers larger and more reliable wins.

Was this interesting?