What 16 RL Libraries Independently Discovered About Keeping GPUs Busy
Source: huggingface
There is something satisfying about watching independent teams reach the same conclusion. The Hugging Face team surveyed 16 open-source RL training libraries and found that virtually all of them arrived at the same core architecture, often without coordinating.
The problem they were all solving: autoregressive generation is slow, and in synchronous RL training, your training GPUs sit idle while the inference cluster grinds through rollouts. For a 32B model, a single batch of 32K-token rollouts can take nearly four hours. The training step itself might take minutes. You are paying for GPU time to watch a progress bar.
The solution everyone landed on independently:
Inference Pool → Rollout Buffer → Training Pool
(continuous) (continuous)
←──── Async Weight Sync ────┘
Disaggregate generation from training, run them concurrently, and sync weights asynchronously. Simple in concept. Wildly varied in implementation details.
Where the Interesting Differences Are
The survey breaks the design space into seven axes. The two I keep thinking about are weight sync protocol and staleness management.
On weight sync: most libraries use NCCL broadcast, with latency ranging from 500ms down to 20ms depending on whether you pack parameters into large buffers. The faster you sync, the less your in-flight rollouts drift from your current policy. But syncing too aggressively stalls generation. The dominant pattern — “soft pause” — drains in-flight requests naturally before updating weights rather than aborting mid-sequence.
On staleness: tokens generated under an old policy are off-policy samples. Three approaches handle this:
- Version rejection — discard samples too old to trust
- Depth bounding — limit how deep your buffer can get architecturally
- Importance sampling correction — keep the samples, reweight the loss
Most production-grade libraries combine all three. PRIME-RL implements the full hybrid. Libraries targeting simplicity pick one and accept the tradeoff.
The LoRA Insight That Changes the Math
One underappreciated finding: LoRA with adapter-only weight sync is nearly standard now. Instead of broadcasting 100-500GB of full model weights, you broadcast ~50MB of adapter parameters. This transforms weight sync from a major bottleneck into a non-issue. Eight of the sixteen libraries already support adapter-only sync, and most of the rest are close.
What Nobody Has Solved Yet
The survey is honest about open problems. The ones that stand out:
Process reward models introduce a new async bottleneck between generation and training. No library currently pipelines PRM scoring with training properly.
MoE training-inference mismatch is subtle and serious. vLLM and Megatron use slightly different floating-point rounding in their routing implementations. Same input, different experts activated. Importance sampling cannot fix a structural divergence like that — you need to record routing decisions during generation and enforce them during training (“Keep Routing”). Nobody implements this yet.
Agentic workloads with long multi-turn trajectories hit a compounding problem: the 90th percentile completion time across multiple agents grows as a product of distributions, not a sum. Episode-level buffer semantics, not token-level, are required. This is a mostly unsolved design space.
The Convergence Story
The real takeaway from this survey is not any single library — it is that the infrastructure space is maturing fast. The core async disaggregated pattern is essentially settled. Ray dominates orchestration for anything multi-node. Bounded async queues with IS correction are the production default.
The frontier is everything the converged baseline does not handle: process rewards, MoE correctness, partial rollout resumption for agentic work. If you are building RL training infrastructure right now, the architecture decisions are cleaner than they were a year ago. The hard problems have moved up the stack.