· 3 min read ·

What 16 RL Libraries Independently Figured Out About Keeping GPUs Busy

Source: huggingface

There is a specific kind of satisfaction in watching independent teams, working in isolation, converge on the same solution. It usually means they found something true.

The HuggingFace team surveyed 16 open-source RL training libraries — OpenRLHF, verl, PRIME-RL, SkyRL, AReaL, and more — looking for patterns in how they handle async training. The finding is striking: nearly all of them independently built the same three-box architecture.

The Problem Is Not What You Think

Synchronous RL training is not compute-bound. It is waiting-bound. A 32B model generating 512 rollouts on a single H100 takes 3.7 hours — and during that entire time, your training GPUs are idle. The bottleneck is autoregressive generation: you cannot parallelize the token-by-token output of an LLM, and you cannot start a training step until all your rollouts are done.

This gets dramatically worse with chain-of-thought and agentic RL. Variable-length completions mean you are always waiting on the slowest sample in the batch. The “straggler problem” familiar from distributed systems shows up here with a vengeance — and with multi-agent setups, the 90th percentile completion time can be 25x the median instead of the usual 5x.

The Convergent Architecture

Every library surveyed eventually landed on the same shape: separate inference and training GPU pools connected by a rollout buffer queue.

Inference Pool (vLLM/SGLang)

  Rollout Buffer (bounded queue)

  Training Pool (FSDP/Megatron)

  Weight Sync (NCCL)

Inference generates continuously. Training consumes from the queue continuously. Weight sync happens in the background. Neither pool idles significantly.

The interesting variation is not in the overall shape — it is in the details: how deep is the buffer, how do you handle stale samples, what happens when a weight update arrives mid-sequence?

The Seven Axes That Actually Matter

The post breaks down the design space into seven dimensions: orchestration primitive, buffer design, weight sync protocol, staleness management, partial rollout handling, LoRA support, and distributed backend.

A few stood out to me.

Weight sync speed matters less for LoRA than you think. Full-parameter sync for a 7B model takes 100-500ms over NCCL. But if you only sync LoRA adapter deltas, that drops to sub-millisecond — roughly 50MB instead of 14GB. Thirteen of the sixteen libraries support LoRA, but only eight sync adapter-only. The ones that do not are leaving significant performance on the table, and the interrupt-granularity problem nearly disappears.

Staleness is a policy question, not just an engineering one. When your rollout buffer is a few steps stale, you are training on samples generated by a slightly older model. Three approaches exist: reject stale samples (wastes compute), cap queue depth (simple architectural bound), or apply importance sampling corrections (preserves throughput but adds complexity). Production systems combine all three.

MoE correctness is quietly broken in most frameworks. DeepSeek-style mixture-of-experts models have a subtle mismatch between inference and training. During inference, expert routing picks the top-2 experts by score. During training, floating-point rounding can select different experts — so your gradients update parameters that were never active during generation. Only Megatron-backed libraries handle this correctly via Expert Parallelism. ZeRO-based frameworks load MoE models fine but silently lose the sparsity benefit.

Why This Is a Systems Problem Wearing an ML Hat

What I find most interesting about this survey is how it reframes RL training as a distributed systems design problem. The questions being asked — how do you bound queue depth, how do you handle in-flight work when state changes, how do you minimize sync latency without sacrificing correctness — are the same questions you would ask building any producer-consumer pipeline at scale.

The ML-specific wrinkle is that your “state” is a 30 billion parameter model and your sync operation involves broadcasting it across a cluster. But the architectural reasoning is the same.

Ray dominates orchestration across 8 of the 16 libraries because the actor model maps cleanly onto RL’s heterogeneous components: separate actors for inference workers, training workers, and reward computation. It handles scheduling and fault tolerance. The operational overhead is real but apparently worth it.

TRL’s planned async trainer looks like the synthesis of all these lessons: bounded queue with per-token version tracking, NCCL bucketing for weight sync, and a pluggable scoring interface so process rewards and distillation work without architectural changes.

If you are doing anything with RL fine-tuning at scale, the full post is worth your time. The design space is more nuanced than the headline architecture suggests, and the emerging problems around process reward models and multi-agent co-evolution are ones the community has not fully solved yet.

Was this interesting?