· 3 min read ·

Why Every RL Training Framework Independently Reinvented the Same Architecture

Source: huggingface

There is a particular kind of insight that only emerges when you compare many independent implementations of the same idea. Hugging Face just published a detailed survey of 16 open-source RL training libraries, and the headline finding is striking: all of them, independently, arrived at the same core architecture.

The problem they were all solving is concrete. In synchronous RL training, the inference step dominates wall-clock time by a wide margin. For a 32B model generating 32K-token rollouts, a single batch takes around 3.7 hours. During all of that, your training GPUs sit completely idle. This is not a minor inefficiency — it is the entire bottleneck.

The Convergent Solution

Every library surveyed independently converged on what the post calls the “disaggregated” topology: put inference on one pool of GPUs, training on another, connect them with a rollout buffer, and let both sides run continuously. The inference pool feeds rollouts into the queue. The training pool pulls from it, computes gradients, and periodically pushes updated weights back. Neither side blocks the other.

This sounds obvious once stated, but the interesting part is the how — and where libraries diverge.

Orchestration: Ray dominates here, used by 8 of the 16 libraries. It solves the actor scheduling and fault tolerance problem well enough that most teams reached for it independently. A few libraries use native Python async primitives or HTTP microservices instead.

Weight sync: NCCL broadcast is the default for around 10 libraries, typically running in the 100–500ms range. verl gets this down to ~20ms using bucketed packing. One outlier, PipelineRL, syncs weights per transformer forward pass — acquiring and releasing a weight lock in a few milliseconds at a time.

Staleness: When the training policy drifts ahead of the inference policy, you have a distribution mismatch. Libraries handle this three ways: drop stale samples outright, bound the buffer depth architecturally, or apply importance sampling correction. Production systems like PRIME-RL and AReaL combine depth bounding with optional IS correction.

The Gaps That Actually Matter

The more interesting section of the post is what none of these libraries handle well yet.

MoE (mixture-of-experts) support is sparse. With ZeRO-based training (DeepSpeed), every forward pass AllGathers every expert, which completely negates the sparsity that makes MoE models efficient. Only the Megatron-backed libraries implement Expert Parallelism correctly. MoE LoRA is nearly nonexistent.

The DeepSeek v3.2 findings are worth reading carefully. There are two structural mismatches between inference and training that importance sampling cannot fix:

  • Expert routing inconsistency: inference and training routers diverge due to floating-point rounding, so the model that generated a rollout is subtly different from the model being trained.
  • Sampling mask mismatch: top-p/top-k sampling excludes tokens at generation time, but training sees the full vocabulary — breaking the IS identity at the token level.

DeepSeek’s solution (“Keep Routing” and “Keep Sampling Mask”) requires extending the data contract between inference and training to pass routing paths and sampling masks alongside token IDs and log probs. The post notes this is a breaking change to every library’s data flow.

Process reward models — where a separate model scores intermediate reasoning steps rather than just final outputs — also break the standard async pattern. Scoring is no longer cheap and fast. No current library automates this cleanly.

What This Means in Practice

If you are building or evaluating RL training infrastructure today, the convergence on Ray + NCCL + bounded queues gives you a reasonable default stack. But the survey makes clear that the next wave of challenges — long-horizon agentic trajectories, MoE training, process rewards, multi-agent self-play — will stress current designs in predictable ways.

The fact that 16 teams independently arrived at the same architecture is useful signal. It means the design is probably right. The fact that all of them have the same gaps is equally useful — it tells you exactly where the work is.

Was this interesting?