The Hidden Bottleneck in RL Training: Why Your GPUs Are Idle Half the Time

If you’ve spent any time watching RL training runs, you’ve probably noticed the wall-clock time doesn’t match your intuitions. A batch of 32K-token rollouts on a 32B model can take nearly four hours on a single H100 — and during most of that time, your training hardware is sitting idle. Hugging Face recently published a thorough survey of 16 open-source RL libraries to understand how the ecosystem is solving this, and the findings are worth sitting with.

The Problem Is Simple; The Solution Is Not

Synchronous RL training alternates between two phases: generate rollouts, then train on them. The problem is that autoregressive generation dominates wall-clock time. Your training GPUs are idle 60%+ of the time, waiting on inference.

All 16 libraries surveyed converged on the same conceptual fix — disaggregate inference from training:

An inference pool (usually vLLM or SGLang) generates rollouts continuously
A bounded rollout buffer decouples the two phases
A training pool processes batches asynchronously and pushes updated weights back

The architecture is intuitive. The interesting engineering is everything that falls out of it.

Staleness Is a Real Systems Problem

Once generation and training run concurrently, the rollout buffer accumulates samples generated under older policy weights. This “staleness” is what makes async RL hard — and the survey’s breakdown of how libraries handle it is the most useful part.

Three strategies exist: reject samples older than some version threshold, bound queue depth so staleness is architecturally impossible, or apply importance sampling corrections to reweight off-policy samples. The production-grade systems (PRIME-RL, AReaL) combine all three rather than committing to any single approach.

What caught my attention was the weight sync interrupt granularity. Most libraries pause generation at coarse boundaries — per-batch or per-HTTP-request — meaning hundreds of milliseconds of idle time at every weight update. Only PipelineRL achieves per-forward-pass weight swaps, slipping updated weights in between token decode steps.

MoE Is Where Things Get Complicated

The survey identifies sparse Mixture-of-Experts models as the place where most libraries quietly fall apart. DeepSpeed ZeRO-3 — used by nearly a third of the libraries — doesn’t support Expert Parallelism. Every forward pass performs an AllGather across all experts, eliminating the entire computational advantage of sparsity.

Worse, there’s a correctness issue that importance sampling can’t fix. During inference, MoE routers make gating decisions in a specific floating-point rounding context. During training, the same model in a different framework makes slightly different decisions. Your gradient updates are computed assuming the wrong experts were active.

DeepSeek’s V3.2 addressed this by recording exact expert routing paths during inference and enforcing them in the training forward pass. No open-source library implements this yet.

What This Means in Practice

For most practitioners fine-tuning dense models in the 7B–70B range, the practical recommendation from the survey is unsurprising: Ray for orchestration, FSDP2 or ZeRO-3 for training, bounded queue with depth 4–8, and LoRA with adapter-only sync (which reduces weight transfer from hundreds of milliseconds to near-zero).

For anyone working with frontier sparse models, the answer is harder. Megatron-backed libraries are essentially the only viable path, and even then the training-inference mismatch problem for MoE is largely unsolved.

The survey reads like a distributed systems paper more than an ML paper, which makes sense — the bottleneck was always throughput engineering, not algorithm design. Getting generation and training to genuinely overlap without correctness regressions is a hard coordination problem, and it’s interesting to watch 16 independent teams arrive at nearly identical architectures from different starting points.