· 2 min read ·

The Clever Communication Trick Behind Million-Token LLM Training

Source: huggingface

Training a transformer on sequences longer than 32k tokens is where the pain begins. Attention is O(n²) in both memory and compute — and even with FlashAttention cutting memory to O(n), a single GPU eventually just runs out. You can’t data-parallel your way out of it either, since each GPU still has to run the full attention block.

The solution Snowflake AI Research landed on — and that Hugging Face just integrated across Accelerate, Transformers, and TRL — is called Ulysses Sequence Parallelism, and the core idea is genuinely elegant.

Heads, Not Rings

The key insight: instead of splitting attention computation by passing KV chunks around a ring of GPUs (Ring Attention), Ulysses splits by attention heads. Each GPU owns a slice of the sequence dimension but gets redistributed — via an all-to-all collective — so it holds all sequence positions for a subset of heads. It computes full attention for those heads locally, then all-to-alls back.

Two collective operations. That’s it.

The communication cost per GPU is O(n·d/P), where P is the number of GPUs in the sequence-parallel group. Ring Attention’s per-GPU cost is O(n·d) — P times more data moved across the wire. At scale, that gap matters enormously.

What the Numbers Actually Look Like

The benchmarks on Qwen3-4B on H100 80GB are worth pausing on:

Sequence LengthThroughput
8K (baseline)3,633 tok/sec
32K (SP=4)~7,600 tok/sec
64K (SP=4)13,396 tok/sec

A 3.7x throughput gain at 64K sequences isn’t just a memory story — it’s a compute efficiency story. Once sequences get long enough, the quadratic attention term dominates everything else, and distributing it stops being a trade-off and starts being a strict win.

The Gotchas Worth Knowing

There are a few sharp edges if you go implement this yourself:

Loss aggregation is non-obvious. Since different ranks compute loss on different sequence positions with different numbers of valid tokens, you can’t just average. You need a token-weighted mean across SP ranks or your training signal will be skewed.

Position IDs, not 4D attention masks. At 128k tokens, a full 4D attention mask is roughly a terabyte. Use position_ids for causal masking instead.

Sequence length must divide evenly by sp_size. Set pad_to_multiple_of=sp_size and move on.

Accessible Now

What makes this worth writing about isn’t that the technique is new — DeepSpeed has had Ulysses for a while. It’s that it’s now wired into the training stack most people actually use:

from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=4,
)

Three lines and your Trainer or SFTTrainer run handles loss aggregation, dataloader wrapping, and gradient scaling automatically.

Long-context training used to mean carefully orchestrated custom infrastructure. Now it’s a config option. That’s the real story.

Was this interesting?