Splitting Attention Across GPUs: How Ulysses Makes Million-Token Training Tractable
Source: huggingface
Training a transformer on a million-token context sounds like a hardware problem. It is, but it’s also an algorithm problem — and Ulysses Sequence Parallelism is a clean solution to both at once.
The fundamental issue is that attention scales O(n²) in compute and memory with sequence length. On a single 80GB H100, you hit a wall somewhere around 8K tokens when training a 4B parameter model under ZeRO-3. Anything longer and you’re OOM. This rules out entire categories of useful training: processing full documents, long codebases, extended reasoning chains.
The Head Locality Insight
Ulysses works by exploiting a property of multi-head attention that’s easy to overlook: attention heads are fully independent of each other. You don’t need all heads on the same GPU — you just need all sequence positions for the heads you’re computing.
Here’s the trick:
- Shard the input sequence across P GPUs (each GPU holds
seq_len / Ptokens) - Compute Q, K, V projections locally on each GPU’s chunk
- Run an all-to-all collective to redistribute: now each GPU holds all sequence positions but only
num_heads / Pheads - Compute full attention for those heads using FlashAttention
- Another all-to-all to reverse the redistribution
- Output projection back on the local sequence chunk
Two all-to-all operations per attention layer. That’s the entire overhead. Communication volume is O(n·d/P) per GPU — P times cheaper than Ring Attention, which does P-1 sequential point-to-point transfers across the ring.
The Numbers Are Hard to Argue With
On 4× H100s training Qwen3-4B, the results from the HuggingFace post are striking:
| Setup | Sequence Length | Throughput |
|---|---|---|
| Baseline (1 GPU) | 8K | 3,633 tok/s |
| SP=4 | 32K | 7,733 tok/s |
| SP=4 | 64K | 13,396 tok/s |
That 3.7x throughput improvement at 64K tokens isn’t magic — it’s because attention compute grows quadratically but communication grows linearly, so longer sequences increasingly amortize the all-to-all overhead. The longer your sequences, the better Ulysses looks relative to alternatives.
And critically, the loss curves match. A/B testing against a standard data-parallel baseline showed mean absolute loss differences of 0.005. The trick to getting this right is scaling gradient accumulation steps with SP degree (GAS = SP) so the effective token count per optimizer step stays constant.
What This Looks Like in Practice
The HuggingFace integration is what makes this actually usable rather than a research curiosity. With the transformers Trainer:
parallelism_config = ParallelismConfig(
sp_backend="deepspeed",
sp_size=4,
sp_handler=DeepSpeedSequenceParallelConfig(
sp_attn_implementation="flash_attention_2",
),
)
The Trainer handles sequence sharding, loss aggregation across ranks, and batch accounting automatically. The main constraints to keep in mind: seq_len % sp_size == 0, num_heads >= sp_size, and you’ll want Flash Attention 2 or 3 (not SDPA).
Worth Paying Attention To
Sequence parallelism has existed in research for a while, but accessible implementations with verified loss equivalence and clean HuggingFace integration change who can actually use it. If you’re training models where context length is a real constraint — document understanding, long-form generation, RAG with many passages — Ulysses is now the pragmatic choice if you have NVLink or InfiniBand interconnects.
The one caveat: it currently requires DeepSpeed as the backend. If you’re on FSDP2, Ring Attention is still your path. But for DeepSpeed users, there’s not much reason to leave 3-4x throughput gains on the table.