What 16 RL Libraries Independently Got Right About Async Training
Source: huggingface
There’s something interesting that happens when a hard enough engineering problem meets enough independent teams: they all end up in roughly the same place. That’s exactly what happened across the async RL training ecosystem, and Hugging Face’s survey of 16 open-source RL libraries is one of the better technical reads I’ve seen this year.
The core problem is deceptively simple: autoregressive generation is slow, and while the model is busy producing tokens, your training GPUs are sitting idle. At long context lengths this gets brutal — a single batch of 32K-token rollouts on a 32B model can take nearly four hours on a single H100. If your training loop is synchronous, that’s four hours of expensive hardware doing nothing.
The Pattern Everyone Converged On
Every library surveyed landed on the same fundamental solution: physically separate the inference GPUs from the training GPUs, let them run at their own pace, and decouple them with a rollout buffer.
Inference pool → [rollout buffer] → Training pool
↑ |
└──── async weight push ───────────┘
Generation runs continuously. Training pulls from the buffer, computes gradients, and pushes updated weights back asynchronously. Both loops stay busy. This convergent design across 16 independent teams is basically empirical proof that this is the right abstraction for the problem.
What differs is the implementation details — and those details matter a lot.
The Staleness Problem Is the Real Design Question
When you decouple generation from training, your rollouts are produced by a slightly older version of the policy. How stale is too stale? Libraries handle this three ways:
- Version rejection — drop samples older than a threshold. Simple, but wastes the compute spent generating them.
- Depth bounding — limit the buffer depth so staleness is bounded by construction.
- Importance sampling correction — reweight stale gradients by the likelihood ratio between old and current policy. Preserves throughput but increases gradient variance.
Production systems tend to combine all three. That’s telling — there’s no clean single answer, just a set of trade-offs you have to navigate based on your specific regime.
Weight Sync Is Wilder Than I Expected
The range of approaches here surprised me. Most libraries do an NCCL broadcast on the order of 100–500ms between training steps. PipelineRL does something qualitatively different: it swaps weights between individual forward passes, taking roughly 1–10ms. That’s not an incremental improvement — it’s a different regime of staleness entirely.
For LoRA training, adapter-only sync shrinks this further to sub-millisecond transfers, since you’re only moving a few million parameters instead of billions. If you’re doing LoRA, this basically eliminates sync latency as a bottleneck.
The MoE Gap
This is the part of the survey I’d flag for anyone training frontier-scale models. Sparse MoE models (DeepSeek-V3, Qwen3-MoE, Mixtral) require Expert Parallelism to preserve the sparsity advantage. ZeRO-based libraries — which cover a lot of the popular tooling — will load MoE checkpoints fine but silently AllGather all experts on every forward pass. You get dense compute with sparse model weights. That’s a correctness problem disguised as a performance problem.
Only Megatron-backed libraries and those using FSDP2 with Expert Parallelism actually handle this correctly. If you’re training a frontier MoE and using DeepSpeed ZeRO, worth double-checking what’s actually happening.
There’s also a deeper issue the post raises around expert routing consistency — inference engines and training stacks can route tokens to different experts due to floating-point differences in the gating function. No current library implements the “Keep Routing” fix that would record routing decisions during sampling and enforce them during the training forward pass. It’s a systematic gap.
The Bigger Takeaway
What I find most useful about this survey is it reframes async RL infrastructure as a general problem, not a GRPO-specific one. Process reward models, on-policy distillation, multi-agent pipelines — they all fit the same generation-scoring-training loop. Libraries that treat the scoring component as a pluggable interface will generalize cleanly; ones that hardcode a verifier call will require forking.
The convergence story is the headline, but the gap analysis is where the actual work is.