The GPU Idle Problem: What 16 RL Libraries Teach Us About Training Efficiency

Here is a number that should make you uncomfortable: when training a 32B model with reinforcement learning using long chain-of-thought rollouts, your training GPUs can sit completely idle for 60% of wall-clock time. The model is busy generating tokens. The GPUs are waiting.

Hugging Face published a thorough survey of 16 open-source RL training libraries and what strikes me most is not any individual library’s clever trick — it is how independently all these teams converged on the same core insight.

The Problem Is Structural

Synchronous RL training works like this: generate a batch of rollouts, wait, train on them, repeat. For short completions this is fine. For long reasoning traces — 8K tokens from a 32B model — a single batch can take nearly an hour to generate. Your H100s are not computing gradients during that hour. They are expensive space heaters.

Everyone Reached the Same Answer

The solution every library landed on independently:

Separate inference and training onto different GPU pools
Connect them through a rollout buffer
Sync weights asynchronously so neither side blocks the other

The differences between libraries show up in the details: how deep the buffer is, how stale you are willing to let your training data get, how you handle a weight update arriving mid-sequence, whether you use Ray for orchestration or roll your own. These are not small differences — the survey shows they translate to 2–5x differences in wall-clock time for the same training run.

The Design Axes That Matter

A few things jumped out at me from the systems perspective:

LoRA weight sync is massively underused. Syncing full model weights over NCCL takes 100–500ms. Syncing only LoRA adapters takes under 1ms. Only 8 of the 13 libraries that support LoRA actually do adapter-only sync. That is free throughput being left on the table.

MoE support is becoming the differentiator. Libraries built on DeepSpeed ZeRO can load MoE models, but without Expert Parallelism they collapse all expert computation onto every device during each forward pass — which defeats the entire point of sparse models. As frontier models trend toward MoE architectures, this is increasingly a hard constraint.

The data contract is growing. The survey surfaces a subtle correctness issue: for MoE models, the expert routing decisions made during generation need to be replayed during training. Right now, no open-source library handles this. The data tuple is expanding from (tokens, logprobs) to (tokens, logprobs, expert_routing, sampling_mask).

Why This Matters Beyond Research

I mostly work on things much smaller than 32B parameter models, but the underlying problem — keeping compute busy while something else is doing slow I/O-bound work — is a classic systems design challenge. The async disaggregation pattern these libraries use is essentially a producer-consumer queue with careful staleness accounting. The RL specifics are layered on top of a fundamentally familiar architecture.

If you are building or fine-tuning models at any scale, the survey is worth reading carefully. The comparison tables alone save hours of digging through individual READMEs.