The Hidden Bottleneck in RL Training (And How 16 Libraries Are Solving It)

There is a paper on the HuggingFace blog right now that I keep coming back to. It surveys 16 open-source RL training libraries and distills what they all converge on. It reads less like ML research and more like distributed systems engineering, which is probably why it hooked me.

The Problem Is Not the Algorithm

The core issue is embarrassingly simple once you see it: autoregressive inference is slow. When you are doing RL training on a large language model, the model has to generate rollouts before you can train on them. A batch of 32K-token rollouts on a 32B model can block your training GPUs for hours. Utilization craters. You are paying for compute that sits idle.

Every library surveyed ends up in the same place to fix it: disaggregate inference and training onto separate GPU pools, connect them with a rollout buffer, and synchronize weights asynchronously. The devil is entirely in the implementation details.

Seven Ways Libraries Differ

The survey breaks down the design space along seven axes. A few that stood out to me:

Orchestration. Ray shows up in 8 of the 16 libraries. The actor model maps cleanly onto RL components — inference workers, training workers, reward models — and you get fault tolerance for multi-day runs for free. I have used Ray for much smaller things and it genuinely does make distributed scheduling feel manageable.

Weight synchronization. The range here is dramatic. Some libraries pause generation entirely until weights sync (simple, slow). Others do per-token swaps between forward passes with sub-millisecond interrupts (PipelineRL). LoRA changes the math significantly — syncing only adapters means ~50MB transfers instead of ~500GB, which turns a 500ms NCCL broadcast into something measured in microseconds.

Staleness. When your rollout buffer holds several batches and training keeps moving, the policy that generated the data is no longer the policy you are training. Libraries handle this with version rejection (discard stale samples), depth bounding (cap queue length), or importance sampling correction (reweight and keep them). Production systems are landing on hybrids of all three.

The Unsolved Problems Are Interesting

The section on open challenges is where the paper gets genuinely speculative in a useful way. Expert routing in MoE models is a real headache: inference and training frameworks can disagree on which expert handles which token due to floating-point differences, which silently corrupts your importance sampling ratios. The proposed fix — record routing decisions at sampling time and enforce them during training — is not implemented anywhere yet.

Process reward models (PRMs) also break the “cheap reward” assumption that makes async pipelines clean. Token-level scoring means your reward computation becomes a bottleneck, requiring its own async tier. Nobody has fully automated this pattern.

Why This Matters Outside RL Research

The broader point the survey makes is that async RL infrastructure is not GRPO-specific. The generate → score → train loop is the same whether you are doing outcome-rewarded GRPO, on-policy distillation, or self-play. Libraries that hardcode a verifier call where a scoring component should be are going to have architectural debt.

For those of us not training 70B models but watching this space: the patterns being worked out here — disaggregated compute, async weight sync, staleness-aware buffers — are going to show up as primitives in whatever the next generation of ML tooling looks like.