The All-to-All Trick Behind Million-Token LLM Training

Training transformers on long sequences has always been an exercise in pain management. Attention is O(n²) in compute, and while FlashAttention brought memory down to O(n), the compute still bites you hard past 32K tokens. Scale to a million tokens and you are not fitting that on a single GPU by any means.

The standard answer has been sequence parallelism, and there are two main approaches worth knowing about: Ring Attention and Ulysses. Hugging Face just published a deep technical walkthrough of Ulysses as part of announcing its integration into Accelerate, Transformers, and TRL. It is worth understanding why Ulysses is interesting at the communication level before reaching for the API.

The Redistribution Insight

The naive approach to sequence parallelism is to shard the sequence across GPUs and then pass key/value chunks around in a ring so every GPU eventually sees the full context. That works, but each GPU ends up doing O(n·d) total communication — proportional to the full sequence length, not the shard.

Ulysses does something cleverer. Each GPU starts with its local sequence chunk and computes Q, K, V projections. Then a single all-to-all operation transposes the distribution: instead of each GPU holding all heads for a sequence chunk, each GPU now holds all sequence positions for a subset of heads. Local attention runs on that. A second all-to-all reverses the layout.

The result is O(n·d/P) communication per GPU — P times less than Ring Attention. The all-to-all also saturates full bisection bandwidth rather than doing P-1 sequential hops, which matters on NVLink interconnects.

The constraint is that num_heads >= sp_size. You cannot split more ways than you have heads. For modern models this is rarely a problem, but it is something to watch.

What the Benchmarks Show

Snowflake’s numbers with Qwen3-4B on 4x H100s are concrete. Running SP=4 with DeepSpeed ZeRO-3 and Flash Attention 2:

At 8K tokens, overhead versus baseline is essentially zero (~300 tokens/s difference)
At 64K tokens, throughput is 3.7x higher than single-GPU baseline at 8K
SP=4 enables 96K-token training at 66 GB peak memory per GPU; 128K OOMs

The throughput gain above baseline is not magic — it is the quadratic attention term dominating at longer sequences, meaning more compute per communication dollar spent.

Loss curve equivalence between DP=4 and SP=4 (with matched gradient accumulation steps) showed a mean absolute difference of 0.005. They are training the same objective.

Using It in Practice

The Accelerate integration is clean. You configure a ParallelismConfig with sp_backend="deepspeed" and sp_size, then pass it to Accelerator or TrainingArguments. The Trainer handles sequence sharding in the dataloader, loss aggregation across SP ranks, and the accounting for batch size.

One gotcha: with custom training loops you need to manually aggregate the loss weighted by token count per rank. Sequences are not always evenly padded, so a naive mean across ranks will skew your gradient. The post includes the full pattern for this.

For 2D parallelism (combining SP and data parallelism), the formula is dp_replicate × dp_shard × sp_size = num_processes. On 4 GPUs you can do sp=2, dp_shard=2 for a balance of sequence length and data throughput, or sp=4, dp_shard=1 when you need maximum context headroom.

Worth Knowing

Ring Attention remains relevant — it has no head-count constraint and works with FSDP2 rather than DeepSpeed. If your model has a small number of heads or you are on a FSDP2 stack, that may be the better fit. The Hugging Face post includes a direct comparison table.

For anyone building infrastructure to fine-tune models on long documents, codebases, or multi-turn conversation logs, Ulysses in Accelerate is now the lowest-friction path to doing it without rewriting your training loop from scratch.