Head Parallelism and the Communication Math That Makes Long-Context Training Scale
Source: huggingface
The baseline problem is straightforward: attention scales quadratically with sequence length. A 128K-token context does not just need 16x more compute than an 8K context; it needs 256x more in the attention layers alone. Memory does not scale much better. The KV activations for a single forward pass at 128K tokens on a large model will exceed the VRAM of any single GPU in common use today.
The Hugging Face blog post on Ulysses sequence parallelism, originally published on March 9, 2026, walks through how to apply DeepSpeed-Ulysses via Accelerate, Transformers, and TRL. Looking back at it now, the integration work is genuinely useful, but the more interesting story is what Ulysses is actually doing under the hood, why it was designed the way it was, and what that design costs you.
The Landscape Before Ulysses
Sequence parallelism as a concept predates DeepSpeed-Ulysses by several years. Megatron-LM introduced a form of sequence parallelism in 2022 (Korthikanti et al.) that partitioned non-tensor-parallel operations, specifically LayerNorm and Dropout, across the sequence dimension. This was designed as a complement to Megatron’s tensor parallelism, not a standalone solution: it reduced activation memory but did not address the core attention scaling problem.
DeepSpeed-Ulysses (Jacobs et al., 2023) attacked the attention computation directly. The design premise differs from Ring Attention: rather than partitioning sequence length and doing peer-to-peer ring communication to exchange KV blocks, Ulysses redistributes attention heads across devices. Each GPU starts with a shard of the full sequence, projects its local tokens into Q, K, and V, then participates in an all-to-all collective that rearranges the data so that each GPU now holds all sequence positions but only for a subset of heads. Local attention runs on that full-sequence, partial-head view, and a second all-to-all restores the sequence-sharded layout for the rest of the layer.
The Communication Arithmetic
The efficiency advantage comes down to communication volume. In Ring Attention, each GPU passes its KV block to the next device in a ring, accumulating partial attention scores over P-1 sequential hops. Total communication per GPU per layer is O(n·d): every token’s key and value eventually traverses every device.
Ulysses uses two all-to-all collectives per attention layer. An all-to-all with P participants and a payload of size N sends N/P data from each source to each destination, so total communication per GPU is O(n·d/P). That is a factor of P less data movement than the ring approach. More importantly, an all-to-all on a modern NVLink fabric uses full bisectional bandwidth: all links are active simultaneously. On an 8-GPU H100 NVLink node with 900 GB/s aggregate bandwidth, ring-style sequential hops saturate one link at a time. The all-to-all uses all of them.
This gap narrows on inter-node InfiniBand, where bisectional bandwidth is not guaranteed and topology matters. But for the common case of training within a single NVLink node, the all-to-all approach wins consistently.
The Head Count Constraint
Ulysses carries one hard constraint that Ring Attention does not: the number of attention heads must be at least as large as the SP degree. SP=4 requires at least 4 attention heads. For standard multi-head attention, this is rarely binding. Qwen3-4B, used in the Hugging Face benchmarks, has 16 attention heads; SP=4 assigns 4 heads per GPU.
Grouped-query attention (GQA) complicates this. GQA models have fewer KV heads than Q heads. If a model has 32 Q heads but only 8 KV heads, SP=4 assigns 8 Q heads and 2 KV heads per GPU, which works. SP=8 would require 8 KV heads and still works. Multi-query attention (MQA), with a single KV head, breaks for any SP > 1. Check your model’s config before setting sp_size; the constraint is on the smaller of the Q and KV head counts in the head-to-rank assignment.
What the Benchmarks Show
The Hugging Face benchmarks run Qwen3-4B on 4x H100 80GB GPUs. At 8K tokens, Ulysses with SP=4 uses essentially the same memory as a single-GPU baseline (22.8 GB vs 22.4 GB). The overhead is minimal and the benefit at short sequences is also minimal, roughly 8% faster throughput. You are not getting much from SP=4 at sequences that fit on one GPU anyway.
The numbers change as sequence length grows:
| Sequence Length | Peak Memory (SP=4) |
|---|---|
| 8K | 22.8 GB |
| 32K | 35.0 GB |
| 64K | 50.5 GB |
| 96K | 66.0 GB |
| 128K | OOM |
Linear scaling from 8K to 96K across 4 GPUs is what you would hope for. The OOM at 128K is expected: 4 × 80 GB = 320 GB total capacity, and the 96K run already uses 264 GB aggregate. Throughput at 64K tokens reaches 13,396 tokens/s versus 3,633 tokens/s on a single GPU at 8K. That 3.7x figure is slightly misleading taken in isolation; the single-GPU baseline cannot run 64K at all. Enabling the run is the feature, and the throughput follows from actual parallel computation.
Loss Aggregation Is Not Trivial
One part of the implementation that deserves more attention than it typically gets: cross-entropy loss computation is not embarrassingly parallel under sequence sharding. Each GPU sees different tokens, and because documents in a batch can have varying lengths, the distribution of non-padding tokens across ranks is uneven. A naive mean across ranks produces wrong loss values whenever token counts differ between ranks.
The correct approach collects both the per-rank loss and the non-padding token count, then computes a weighted sum:
losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
total_loss = sum(
losses_per_rank[i] * good_tokens_per_rank[i]
for i in range(sp_size)
if good_tokens_per_rank[i] > 0
)
loss = total_loss / max(sum(good_tokens_per_rank), 1)
This matters for gradient correctness. The Hugging Face article verifies loss equivalence over 20 training steps against a non-SP baseline, reporting a mean absolute difference of 0.0054 and a canonical NLL difference of 0.000004. The residual is from floating-point reordering under distributed reduction, not a systematic error. When using the raw Accelerate path rather than the Trainer, this aggregation is your responsibility to implement.
The Integration
The practical surface in Accelerate is a ParallelismConfig object:
from accelerate import Accelerator
from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig
parallelism_config = ParallelismConfig(
sp_backend="deepspeed",
sp_size=4,
dp_shard_size=1,
sp_handler=DeepSpeedSequenceParallelConfig(
sp_seq_length_is_variable=True,
sp_attn_implementation="flash_attention_2",
),
)
accelerator = Accelerator(parallelism_config=parallelism_config)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
The Transformers Trainer and TRL SFTTrainer accept the same config via TrainingArguments. When using the Trainer, dataloader wrapping through UlyssesSPDataLoaderAdapter, gradient accumulation scaling, and loss aggregation all happen automatically. The requirements are deepspeed>=0.18.1, accelerate>=1.12, and transformers>=5.0.
A few configuration details worth getting right upfront. Set pad_to_multiple_of=sp_size to guarantee sequence length divisibility. Use position_ids rather than attention masks: masks are O(n²) memory, position IDs are O(n), and at 96K tokens the difference is substantial. Set PYTORCH_ALLOC_CONF=expandable_segments:True to reduce allocator fragmentation during variable-length training. Flash Attention 2 is the recommended attention backend for Ampere GPUs; Flash Attention 3 for Hopper.
Balancing SP and DP
With N GPUs, you allocate them to sequence parallelism, data parallelism, or some combination. SP=4, DP=1 on 4 GPUs enables the longest context but processes only one batch replica at a time. SP=2, DP=2 halves the maximum context but doubles throughput from parallel batch processing. The Accelerate config accepts both sp_size and dp_shard_size explicitly.
For fine-tuning on long documents where fitting the context is the bottleneck, maximizing SP makes sense. For instruction tuning with shorter mixed-length data, a 2D configuration typically yields better overall throughput. There is no universal answer; it depends on your data distribution.
Where This Fits in the Parallelism Stack
Serious long-context training today stacks multiple parallelism strategies. DeepSpeed ZeRO Stage 3 shards optimizer states, gradients, and parameters across the data-parallel dimension. Ulysses SP handles the sequence dimension and activation memory. Tensor parallelism shards weight matrices. Pipeline parallelism splits the model across stages.
Ulysses and ZeRO Stage 3 compose directly. The combination matters because SP alone does not help with parameter memory. A 70B model in bf16 requires roughly 140 GB for parameters alone, which does not fit on 4 H100s regardless of SP degree. ZeRO Stage 3 handles parameter sharding; SP handles activation memory and attention compute. The YAML config in Accelerate accepts both zero_stage: 3 and parallelism_config fields in the same file.
The original DeepSpeed-Ulysses implementation required writing distributed training loops by hand. Having it behind a parallelism_config argument lowers the barrier substantially. The benchmark numbers confirm the algorithm’s properties survive the integration layers intact, which is not guaranteed when abstractions get added. The communication algebra that makes Ulysses efficient on NVLink hardware is unchanged; the work is in making it composable with the rest of the training stack.