· 6 min read ·

What Actually Breaks When You Train RL on a Production MoE Model

Source: huggingface

LinkedIn published a retrospective in January 2026 that does something rare: it documents what actually broke, with math and code, rather than just presenting the final results. The subject is agentic reinforcement learning training on GPT-OSS, LinkedIn’s open-source model family, and the post reads like a debugging war story that happens to involve PPO, Mixture-of-Experts architectures, and FlashAttention internals.

The broader context matters. After DeepSeek-R1 demonstrated that RL-based post-training could dramatically improve reasoning capabilities, the open source ecosystem moved fast. Frameworks like verl and OpenRLHF emerged to make PPO and GRPO training accessible outside of hyperscaler infrastructure. LinkedIn wanted to go further than math reasoning benchmarks: train a model to use tools across multi-step agentic interactions. What they found is instructive for anyone working at this layer of the stack.

The Setup

GPT-OSS is a Mixture-of-Experts model with variants at 20B and 120B parameters. The training framework is verl, which separates the rollout phase (generating trajectories using an inference engine like vLLM or SGLang) from the training phase (computing PPO updates using FSDP). This decoupled architecture is standard for large-scale RL training: inference engines are optimized for throughput, while training engines are optimized for gradient computation.

The tasks LinkedIn used as benchmarks span single-turn and multi-turn agentic settings: GSM8K for grade school math, VerifyIf for instruction following with verifiable constraints, and ReTool for multi-turn coding with a live code execution tool. Hardware was 16 nodes of H200 GPUs, bf16 precision, with 8k prompt and 16k response context lengths. Three distinct problems had to be solved before any of this worked.

Problem One: MoE Makes PPO’s Fundamental Assumption False

PPO (Proximal Policy Optimization) works by comparing the probability of an action under the current policy versus the probability under the policy that generated the trajectory. This ratio, the importance weight, is expected to be close to 1 when you sample a trajectory and immediately update on it. The clipping mechanism in PPO is calibrated around this assumption.

verl computes log-probabilities twice: once during rollout to get π_old, and once during the training update to get π_current. For a standard dense transformer, these are equal for on-policy training. For a Mixture-of-Experts model, they are not.

MoE routing is non-deterministic. Expert selection depends on top-k softmax over router logits, and even with the same input, routing can differ between two forward passes when there are numerical differences in the computation path. Different CUDA kernels, different memory layouts, or different precision states between the rollout and training passes produce different routing decisions and therefore different log-probabilities for the same tokens.

The symptom is subtle: the importance ratio is not exactly 1, so PPO’s clipping triggers on gradients that should not be clipped. The fix LinkedIn implemented is direct:

if on_policy:
    old_log_prob = log_prob.detach()  # Force ratio = 1
else:
    old_log_prob = model_inputs["old_log_probs"]

For on-policy training, skip recomputing old log-probs entirely. Take the freshly computed log-probs and detach them. This forces the importance ratio to 1 by definition. It trades the theoretical elegance of recomputing old log-probs for numerical stability, and for MoE models the trade is worth it.

Problem Two: Kernels Don’t Agree

The second problem is what happens when your training stack and inference stack use different low-level implementations of the same operation.

vLLM uses a custom Triton kernel for attention during inference. verl’s training path uses FlashAttention-v2 with FSDP. Both are correct implementations of scaled dot-product attention, but they are not numerically identical. Differences in floating-point reduction order, tiling strategies, and kernel fusion produce token-level log-probability differences between inference-generated trajectories and training-computed log-probabilities.

From the RL training perspective, this looks like off-policy data. The model sees trajectories that appear to have been generated by a slightly different policy, and the importance weights reflect this. Gradient norms explode, rewards stop improving, and the training appears to run while optimizing noise.

LinkedIn implemented sequence-level importance sampling as a partial mitigation, which stabilized gradient norms. But the underlying mismatch persisted and training still did not converge. This is a systemic problem in the space. verl, TRL, OpenRLHF, and similar frameworks all face the same tension: the inference engines that produce the best rollouts are not the same codebases as the training frameworks that compute gradients. Importance sampling can absorb small divergences, but it breaks down when divergences are structural.

Problem Three: The Architecture Feature Your Framework Doesn’t Know About

The root cause turned out to be something more specific: attention sinks.

Attention sinks are learnable scalar parameters, one per attention head, that participate in the softmax normalization of attention scores without contributing to the output computation. They were introduced to stabilize attention in streaming inference and sliding-window attention settings, where the model needs a mechanism to absorb probability mass that would otherwise make the softmax ill-conditioned.

The modified attention computation works like this:

# Standard attention
scores = Q @ K.T / sqrt(d)
probs = softmax(scores, dim=-1)
output = probs @ V

# Attention with sinks
scores = Q @ K.T / sqrt(d)
combined = concat([scores, sink_param], dim=-1)  # sink_param is learnable
probs = softmax(combined, dim=-1)
probs_content = probs[..., :-1]  # Drop sink column
output = probs_content @ V

vLLM’s fork of FlashAttention supported attention sinks in its forward pass. verl’s training path used the official FlashAttention-v2, which had no sink support at all. This was not a minor numerical difference; it was a structural difference in the computation. The training backward pass was computing gradients through an attention operation that differed meaningfully from the inference forward pass that generated the trajectories.

The gradient computation for the sink parameter is non-trivial. Since the sink does not contribute to the output directly, the upstream gradient with respect to the sink logit is zero. But the sink still affects all other attention probabilities through the shared softmax denominator. LinkedIn derived the backward pass from scratch:

# General gradient for sink logit S_h:
∂L/∂S_h = -Σ_i P_i,h (∂L/∂S_i,h - Σ_j P_ij ∂L/∂S_ij)

# Since sink doesn't contribute to output, ∂L/∂S_i,h = 0:
∂L/∂S_h = -Σ_i P_i,h (Σ_j P_ij ∂L/∂S_ij)

They adapted the forward pass from vLLM’s FlashAttention fork, implemented the backward pass, and integrated sequence-parallelism support for FlashAttention-v3. With this fix in place, the training curves changed dramatically: GSM8K converged significantly faster, VerifyIf showed steady reward improvement rather than collapse, and ReTool stabilized with improving validation accuracy across the board.

Memory: Two More Walls

Alongside the numerical correctness issues, LinkedIn hit two distinct memory problems worth noting.

The first was MoE expert materialization. HuggingFace Transformers has two codepaths for MoE forward passes: one for inference that materializes hidden states for all experts simultaneously, producing large intermediate tensors, and one for training that loops sequentially over experts with much lower memory usage. During FSDP training, verl’s log-probability computation triggered the inference path, which attempted to allocate 180 GB on a 139 GB GPU:

torch.OutOfMemoryError: Tried to allocate 180.00 GiB on 139.72 GiB GPU
Location: transformers/models/gpt_oss/modeling_gpt_oss.py:123
hidden_states = hidden_states.repeat(num_experts, 1)

The fix was patching the call to prefer the sequential training path.

The second was sequence length. Agentic RL requires long contexts. With 8k prompts and 16k response windows, FSDP parameter sharding alone is insufficient because activation memory scales with sequence length. LinkedIn implemented sequence parallelism with FlashAttention-v3: sequences are partitioned across GPUs, non-attention layers process their chunks independently, and attention layers use all-to-all communication to reassemble full sequences before the attention computation. Peak activation memory per GPU scales inversely with the sequence parallelism degree, which is precisely what’s needed for scaling context length without proportionally scaling GPU count.

What This Retrospective Actually Says

The deeper lesson here is about the assumptions baked into RL training frameworks. verl, like most frameworks in this space, was built and validated primarily on dense transformer architectures. PPO’s on-policy assumption is well-understood for those. MoE routing breaks it in a way that’s hard to observe directly. Attention sinks break the training-inference symmetry in a way that’s invisible unless you examine attention kernel compatibility at a low level. These are not bugs in verl as a general framework; they are gaps between the framework’s assumptions and the architecture of the model being trained.

As the open source RL training ecosystem matures, more of these architecture-specific gaps will surface. The DeepSeek-R1 wave pushed the industry toward RL post-training, but the models being trained now are more diverse and more complex than the ones the frameworks were designed around. LinkedIn’s contribution here, beyond fixing their own training run, is documenting the failure modes in enough detail that other teams can diagnose them before burning H200-hours on non-converging training runs.

The FlashAttention backward pass contribution is particularly worth watching. Attention sinks are increasingly common in models designed for streaming or long-context inference, and any team training those models with verl will hit the same wall. Closing that gap in the official FlashAttention codebase benefits everyone building in this space.

Was this interesting?