Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

Anna Golubeva; Quentin Anthony; Vasu Shyam

arxiv: 2604.26294 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.DC

Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference

Vasu Shyam , Anna Golubeva , Quentin Anthony This is my paper

Pith reviewed 2026-05-07 13:31 UTC · model grok-4.3

classification 💻 cs.CL cs.DC

keywords tensor parallelismsequence parallelismmemory-efficient trainingtransformer modelsattention layersgated MLPsparallel execution schedules

0 comments

The pith

TSP folds tensor and sequence parallelism onto one device axis so each rank holds both a weight shard and a sequence shard.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces tensor and sequence parallelism (TSP) that merges two common sharding methods onto the same set of devices instead of using separate mesh dimensions. Tensor parallelism normally splits model weights across devices to cut parameter memory, while sequence parallelism splits input tokens to cut activation memory. By giving every rank one shard of each, TSP lowers both memory types along a single axis. The authors supply concrete execution schedules: for attention, devices broadcast parameter shards and swap key-value pairs to rebuild context; for gated MLPs, weights rotate in a ring while outputs accumulate locally. They analyze the resulting communication costs theoretically and benchmark the approach against plain TP, SP, and TP+SP to show its utility when memory is tight or contexts are long.

Core claim

TSP assigns each rank both a weight shard and a sequence shard along one device axis. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead and supplies runtime schedules plus theoretical analysis for its use in long-context and memory-constrained training and inference.

What carries the argument

The TSP folding strategy that places both tensor-parallel weight shards and sequence-parallel token shards on the same device axis, together with its attention and gated-MLP execution schedules.

Load-bearing premise

The extra communication introduced by folding the two schemes onto one axis stays practical and does not cancel out the memory savings on real hardware.

What would settle it

A benchmark run on target hardware where TSP's end-to-end training or inference time exceeds that of TP+SP by enough to erase any net memory benefit for the same model size and batch.

Figures

Figures reproduced from arXiv: 2604.26294 by Anna Golubeva, Quentin Anthony, Vasu Shyam.

**Figure 1.** Figure 1: Per-GPU theoretical memory breakdown by parallelism strategy for a 7B-parameter dense transformer model with view at source ↗

**Figure 2.** Figure 2: Weights and sequence sharding under TP, SP, TP+SP, and TSP. Note that TP+SP uses two orthogonal axes to split the view at source ↗

**Figure 3.** Figure 3: TSP Attention block design view at source ↗

**Figure 4.** Figure 4: TSP MLP block design Each rank stores a zigzag-partitioned sequence shard (Brandon et al., 2023) and a shard of the attention projections. The shards for WQ, WK, WV , and WO are packed into a single buffer and broadcast one weight-owning rank at a time. After receiving the shard for iteration r, every rank applies it to its local tokens to produce the queries, keys, and values for that head shard. Causal a… view at source ↗

**Figure 5.** Figure 5: Forward-pass theoretical communication volume (a) and per-GPU memory (b) as a function of sequence length, for the view at source ↗

**Figure 6.** Figure 6: Forward-pass theoretical communication volume (a) and per-GPU memory (b) as a function of total parameter count view at source ↗

**Figure 7.** Figure 7: Ratio of theoretical TSP to TP forward communication view at source ↗

**Figure 8.** Figure 8: The architecture of the Zyphra pretraining cluster. Each node contains 8 MI300X GPUs interconnected with Infinity view at source ↗

**Figure 9.** Figure 9: Per-GPU peak memory as a function of sequence view at source ↗

**Figure 10.** Figure 10: Forward-pass throughput (tokens/s) versus sequence length for TSP and matched TP+SP baselines at folded degrees view at source ↗

**Figure 11.** Figure 11: Forward-pass throughput (tokens/s) versus sequence length at degrees view at source ↗

**Figure 12.** Figure 12: Forward+backward throughput (tokens/s) versus micro-batch size at degrees view at source ↗

read the original abstract

We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSP folds tensor and sequence parallelism onto one device axis to cut both weight and activation memory, with custom schedules for attention and gated MLPs, but the net win depends on whether the added communication stays tolerable.

read the letter

The main contribution here is a parallelism layout that assigns each rank both a weight shard and a sequence shard on the same set of devices. This differs from the usual approach of giving tensor parallelism and sequence parallelism their own mesh dimensions. The result is lower per-device memory for parameters and activations at the cost of more data movement within the attention and MLP blocks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Tensor and Sequence Parallelism (TSP), a parallel execution strategy that folds tensor parallelism (TP) and sequence parallelism (SP) onto a single device axis. Each rank is assigned both a weight shard and a sequence shard, reducing per-device parameter and activation memory along the same axis. Two runtime schedules are described: for attention, ranks iterate over broadcast parameter shards and perform sequence-wise key/value exchanges to reconstruct context; for gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. The work supplies a theoretical communication and memory analysis, implementation details for the attention and gated MLP blocks, and benchmarks comparing TSP to TP, SP, and TP+SP. TSP is positioned as a hardware-aware alternative for long-context and memory-constrained training that can be combined with pipeline and expert parallelism for dense and MoE models.

Significance. If the reported benchmarks and analysis confirm that the additional communication volume (sequence-wise KV exchanges and ring circulations) remains practical relative to the memory savings on real interconnects, TSP could provide a useful new dimension for parallelism in Transformer training and inference. It directly targets the tension between parameter/activation memory and sequence length in large models, offering a way to shard both weights and activations without requiring separate mesh dimensions. The explicit schedules for attention and gated MLPs, together with the theoretical cost analysis, strengthen the contribution and its potential integration with existing schemes.

major comments (2)

[theoretical communication and memory analysis] The central viability claim—that folding TP and SP onto one axis yields a net practical win—depends on the communication overhead not offsetting memory reductions. The theoretical analysis should include explicit scaling equations for the extra all-to-all or ring traffic (e.g., volume as a function of hidden size, sequence length, and number of ranks) and direct comparison to the per-device memory savings; without these, it is difficult to evaluate whether the trade-off remains favorable outside the tested regimes.
[benchmarks] The benchmarks section must report not only memory usage but also end-to-end iteration time or throughput on representative GPU clusters, including scaling behavior with sequence length and model size. If the added communication dominates iteration time, the positioning of TSP as a hardware-aware alternative is undermined even if memory numbers improve.

minor comments (2)

[TSP attention schedule] Clarify the exact communication primitives used (e.g., all-to-all vs. ring) in the attention schedule and whether they assume specific interconnect topologies.
[introduction] The abstract states that TSP can be used 'in concert with' pipeline and expert parallelism; a brief discussion or diagram showing the combined mesh layout would help readers understand the integration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of TSP's potential. We address each major comment below, agreeing that additional detail will strengthen the manuscript, and outline the revisions we will make.

read point-by-point responses

Referee: The central viability claim—that folding TP and SP onto one axis yields a net practical win—depends on the communication overhead not offsetting memory reductions. The theoretical analysis should include explicit scaling equations for the extra all-to-all or ring traffic (e.g., volume as a function of hidden size, sequence length, and number of ranks) and direct comparison to the per-device memory savings; without these, it is difficult to evaluate whether the trade-off remains favorable outside the tested regimes.

Authors: We agree that explicit scaling equations and direct comparisons would make the viability analysis more rigorous and easier to evaluate across regimes. The manuscript already derives communication volume and memory footprint for TSP versus TP, SP, and TP+SP, but we will revise the theoretical analysis section to add closed-form expressions for the additional TSP traffic (KV exchanges for attention and ring circulations for gated MLPs) as functions of hidden size H, sequence length S, and rank count N, together with side-by-side formulas showing net memory reduction per device. These additions will be placed immediately before the benchmark results. revision: yes
Referee: The benchmarks section must report not only memory usage but also end-to-end iteration time or throughput on representative GPU clusters, including scaling behavior with sequence length and model size. If the added communication dominates iteration time, the positioning of TSP as a hardware-aware alternative is undermined even if memory numbers improve.

Authors: We concur that end-to-end timing and scaling curves are essential to substantiate the hardware-aware claim. Our current benchmarks already measure memory and compare TSP to the three baselines, but we will expand the experimental section to include wall-clock iteration times and tokens-per-second throughput on multi-GPU clusters. We will add plots showing how these metrics scale with sequence length (up to the longest contexts tested) and model size, allowing readers to see precisely when the extra communication is amortized by the memory savings. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering proposal with independent analysis and benchmarks

full rationale

The paper describes TSP as a practical folding of TP and SP onto one axis, supplies runtime schedules for attention and gated MLPs, and reports theoretical communication/memory analysis plus empirical benchmarks against TP, SP, and TP+SP. No equations derive a result that is definitionally equivalent to its inputs, no parameters are fitted to a subset and then relabeled as predictions, and no load-bearing claims rest on self-citations or uniqueness theorems imported from prior author work. The central viability claim (memory savings outweigh added communication) is presented as an empirical trade-off to be evaluated on hardware, not a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is a proposed parallelism layout and runtime schedules; it introduces no new mathematical axioms, fitted constants, or postulated physical entities beyond standard assumptions of distributed transformer training.

pith-pipeline@v0.9.0 · 5520 in / 1176 out tokens · 39481 ms · 2026-05-07T13:31:48.685601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

One Pensando DSC 200 GbE cloud NIC for loading data and checkpoints

network interface cards (NICs), each at 400Gbps. One Pensando DSC 200 GbE cloud NIC for loading data and checkpoints. •Storage: 25.6 TB split into 8 physical NVMe drives (Micron MTFDKCC3T2TGQ-1BK1DABDB), each with 3.2 TB

work page
[2]

Specifically, 16x16 GB DIMMs of Samsung M321R2GA3BB6-CQKET, running at 4800 MT/s

Storage Node:The storage node contains: •RAM: 256 GB of DDR5 RAM. Specifically, 16x16 GB DIMMs of Samsung M321R2GA3BB6-CQKET, running at 4800 MT/s. •CPU: 2 physical sockets of Intel(R) Xeon(R) Gold 6426Y , each with 16 physical cores and 2 threads per core. Each socket is connected to 128 GB of RAM (8 DIMMs). •Networking Cards: One Pensando DSC 100 GbE cl...

work page
[3]

•CPU: 2 sockets of Intel Xeon (Sapphire Rapids), each with 8 cores and 2 threads per core (virtualized under KVM)

Login Node:The login node is a VM that contains: •RAM: 80 GB system memory, installed as 5×16 GB DIMMs (virtual/QEMU). •CPU: 2 sockets of Intel Xeon (Sapphire Rapids), each with 8 cores and 2 threads per core (virtualized under KVM). •Storage: 1 TB total, split into 3 virtual disks: vda (100 GB), vdb (520 GB), and vdc (520 GB) APPENDIXB AUTHORCONTRIBUTION...

work page

[1] [1]

One Pensando DSC 200 GbE cloud NIC for loading data and checkpoints

network interface cards (NICs), each at 400Gbps. One Pensando DSC 200 GbE cloud NIC for loading data and checkpoints. •Storage: 25.6 TB split into 8 physical NVMe drives (Micron MTFDKCC3T2TGQ-1BK1DABDB), each with 3.2 TB

work page

[2] [2]

Specifically, 16x16 GB DIMMs of Samsung M321R2GA3BB6-CQKET, running at 4800 MT/s

Storage Node:The storage node contains: •RAM: 256 GB of DDR5 RAM. Specifically, 16x16 GB DIMMs of Samsung M321R2GA3BB6-CQKET, running at 4800 MT/s. •CPU: 2 physical sockets of Intel(R) Xeon(R) Gold 6426Y , each with 16 physical cores and 2 threads per core. Each socket is connected to 128 GB of RAM (8 DIMMs). •Networking Cards: One Pensando DSC 100 GbE cl...

work page

[3] [3]

•CPU: 2 sockets of Intel Xeon (Sapphire Rapids), each with 8 cores and 2 threads per core (virtualized under KVM)

Login Node:The login node is a VM that contains: •RAM: 80 GB system memory, installed as 5×16 GB DIMMs (virtual/QEMU). •CPU: 2 sockets of Intel Xeon (Sapphire Rapids), each with 8 cores and 2 threads per core (virtualized under KVM). •Storage: 1 TB total, split into 3 virtual disks: vda (100 GB), vdb (520 GB), and vdc (520 GB) APPENDIXB AUTHORCONTRIBUTION...

work page