AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Ahan Gupta; Masahiro Tanaka; Minjia Zhang; Neel Dani; Olatunji Ruwase; Zhihao Wang

arxiv: 2604.27089 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.DC· cs.PF

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism

Ahan Gupta , Zhihao Wang , Neel Dani , Masahiro Tanaka , Olatunji Ruwase , Minjia Zhang This is my paper

Pith reviewed 2026-05-07 10:07 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF

keywords sequence parallelismlong-context LLMcompiler-based optimizationactivation checkpointingtraining parallelismscalabilityautomated optimization

0 comments

The pith

Compiler automation of sequence parallelism enables up to 2.7 times longer context training for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that an automated compiler can handle the complex task of adding sequence parallelism and long-context activation checkpointing to LLM training pipelines. This automation eliminates the need for manual rewriting of training code, which currently requires deep expertise in composing different parallelism strategies. A sympathetic reader would care because it promises to make training on very long sequences practical and accessible without sacrificing speed. The authors demonstrate this by achieving up to 2.7 times longer contexts on NVIDIA hardware and 2.5 times on AMD hardware compared to manual baselines.

Core claim

AutoSP compiles the model and automatically applies sequence parallelism together with long-context aware activation checkpointing to existing training setups. This approach allows LLM training to scale to much longer input sequences while preserving the performance of the original pipeline.

What carries the argument

AutoSP, the compiler that identifies opportunities for sequence parallelism and activation checkpointing and integrates them into the training process.

Load-bearing premise

The compiler is able to correctly and automatically compose sequence parallelism with existing training pipelines without introducing bugs or needing manual model-specific fixes.

What would settle it

Running a training job with the compiler on a known long-context task and observing either incorrect loss values or runtime errors that do not occur in the baseline would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.27089 by Ahan Gupta, Masahiro Tanaka, Minjia Zhang, Neel Dani, Olatunji Ruwase, Zhihao Wang.

**Figure 1.** Figure 1: DeepSpeed-Ulysses with 3-SP groups. alltoall operators toggle the layout of activations at attention-layer boundaries. Linearprojections operate on the partial sequence length, while attention-layers operate on a subset of the heads. First, tokens are sharded across the sequence dimension, with different devices operating on different parts of the input sequence. Next, linear projections form multiple … view at source ↗

**Figure 2.** Figure 2: A sample neural network compiled using PyTorch-2.0. We illustrate the lowering to Torch-IR and Aten-IR that occur within Dynamo and the extra operators inserted during the lowering process. PyTorch-2.0 compiler. PyTorch-2.0 (Ansel et al., 2024) is a just-in-time deep-learning compiler that targets training workloads. It comprises of two components: dynamo and inductor each with a series of compiler passe… view at source ↗

**Figure 3.** Figure 3: An overview of AutoSP. AutoSP enables an automated approach to scale input context view at source ↗

**Figure 4.** Figure 4: A comparison of PyTorch-2.0 vs. AutoSP’s AC-passes on a sample codesnippet. The red line is the additional edge from source to node in PyTorch-2.0’s ACpass to enforce no rematerialization of compute heavy operators. AutoSP’s AC-pass, in removing this constraint, reduces memoryconsumption at negligible cost to throughput. Why PyTorch-2.0’s AC solution is insufficient. PyTorch-2.0’s AC-pass is effective … view at source ↗

**Figure 5.** Figure 5: Maximum sequence length prior to OOM across various model sizes. AutoSP increases the trainability of all model sizes. 195 205 199 ZeRO-3 RingAttn DS-Ulysses AutoSP 8K 24K 90K 0 15 30 45 Time (s) 46 10 29 9 26 9 31 OOM OOM view at source ↗

**Figure 7.** Figure 7: Comparing the max sequence length prior to OOM across different hardware. AutoSP enables training on longer sequences on NVIDIA superchips and AMD hardware. Runtime Performance. For different techniques, we identify the impact on training iteration times at various sequence lengths in view at source ↗

**Figure 8.** Figure 8: Average execution time of various Llama 3.2 model sizes at different sequence lengths on NVIDIA and AMD hardware. AutoSP matches the performance of hand-written baselines and supports longer sequence training. In this section, we run additional studies to demonstrate the effectiveness of AutoSP, as well as to ascertain the impact of each optimization on different components of LLM training. We compare… view at source ↗

**Figure 9.** Figure 9: Breakdown of Attention and MLP operator memory consumption and forward iteration times. AutoSP reduces the activation memory of Attention and MLP operators with a marginal performance difference. Breakdown analysis. We breakdown the impact of AutoSP’s optimizations in view at source ↗

read the original abstract

Large-language-models (LLMs) demonstrate enormous utility in long-context tasks which require processing prompts that consist of tens to hundreds of thousands of tokens. However, existing LLM training libraries do not provide easy to use abstractions to optimize for long-context training, instead focusing on optimizations for models with large parameter counts through ZeRO-3/FSDP, Tensor and Pipeline parallelism. This forces users to rewrite LLM training libraries to incorporate compositions of various complex long-context optimizations, such as sequence-parallelism, to training pipelines; a process that requires in-depth expertise, reducing developer productivity. To tackle these challenges, we introduce AutoSP: the first automated solution to automatically optimize LLM training for longer-contexts. AutoSP compiles models and applies a targeted set of optimizations: automated sequence parallelism, and long-context aware activation-checkpointing, to drastically enhance LLM trainability at negligible cost to throughput. Our evaluation demonstrates AutoSP's capability on both NVIDIA and AMD hardware, increasing training contexts by upto 2.7$\times$ and 2.5$\times$ respectively over competitive hand-written baseline at negligible cost to runtime performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoSP automates sequence parallelism for long-context LLMs via compiler but the correctness of its composition with ZeRO-3/FSDP and other parallelisms rests on thin evidence.

read the letter

AutoSP is a compiler pass that tries to insert sequence parallelism and long-context activation checkpointing into existing LLM training code without forcing users to rewrite their pipelines by hand. The main result is that it lets training contexts grow 2.7× on NVIDIA and 2.5× on AMD hardware while keeping throughput close to the hand-tuned baselines. That automation claim is the part worth paying attention to, because most current long-context work still requires manual sharding tweaks on top of ZeRO-3, FSDP, tensor, and pipeline parallelism.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AutoSP, a compiler-based system that automatically rewrites LLM training code to insert sequence parallelism and long-context-aware activation checkpointing. It claims this enables training with substantially longer contexts (up to 2.7× on NVIDIA hardware and 2.5× on AMD hardware) compared to competitive hand-written baselines while incurring negligible runtime overhead, by composing with existing strategies such as ZeRO-3/FSDP, tensor parallelism, and pipeline parallelism without manual intervention.

Significance. If the compiler passes are shown to preserve semantics when composing sequence parallelism with sharded data-parallel and other parallelism strategies, the work would meaningfully lower the barrier to long-context LLM training by automating what currently requires deep systems expertise. The automation of checkpointing and parallelism composition is a practical strength, but the manuscript supplies no machine-checked proofs, reproducible artifacts, or falsifiable verification steps to support the correctness claim.

major comments (3)

[Abstract] Abstract: The headline performance claims (2.7× / 2.5× context increases at negligible throughput cost) rest on an unverified assumption that the compiler correctly composes sequence parallelism with ZeRO-3/FSDP, tensor, and pipeline parallelism; no details are given on how activation sharding, attention masks, or gradient synchronization are handled for long sequences.
[Section 3] Section 3 (Compiler passes): The targeted optimizations for automated sequence parallelism and long-context checkpointing are described at a high level only; the manuscript provides neither the transformation rules nor any equivalence argument showing that the rewrites preserve training semantics when interleaved with existing sharding strategies.
[Section 4] Section 4 (Evaluation): No ablation studies, model-quality metrics (e.g., perplexity or downstream task performance), or error analysis are reported to confirm that the automatically generated long-context training runs remain stable and produce equivalent models to the hand-written baselines.

minor comments (2)

[Abstract] Abstract: 'upto' should be written as 'up to'.
[Section 3] The manuscript would benefit from a clear statement of the supported model architectures and any limitations on the automatic composition (e.g., which attention variants or checkpointing policies are handled).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback, which identifies key areas where additional detail and evaluation would strengthen the manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (2.7× / 2.5× context increases at negligible throughput cost) rest on an unverified assumption that the compiler correctly composes sequence parallelism with ZeRO-3/FSDP, tensor, and pipeline parallelism; no details are given on how activation sharding, attention masks, or gradient synchronization are handled for long sequences.

Authors: We agree that the abstract would benefit from more detail on composition. In the revised manuscript we will expand the abstract with a concise description of how AutoSP handles activation sharding, attention masks, and gradient synchronization when composing sequence parallelism with ZeRO-3/FSDP, tensor parallelism, and pipeline parallelism. These mechanisms will be elaborated in the updated Section 3. revision: yes
Referee: [Section 3] Section 3 (Compiler passes): The targeted optimizations for automated sequence parallelism and long-context checkpointing are described at a high level only; the manuscript provides neither the transformation rules nor any equivalence argument showing that the rewrites preserve training semantics when interleaved with existing sharding strategies.

Authors: We accept that Section 3 is currently high-level. We will add the concrete transformation rules for the sequence-parallelism and checkpointing passes, together with an informal equivalence argument that explains how semantics are preserved under interleaving with sharding strategies. We do not supply machine-checked proofs, as this is a practical compiler implementation rather than a formally verified system. revision: partial
Referee: [Section 4] Section 4 (Evaluation): No ablation studies, model-quality metrics (e.g., perplexity or downstream task performance), or error analysis are reported to confirm that the automatically generated long-context training runs remain stable and produce equivalent models to the hand-written baselines.

Authors: We will revise Section 4 to include ablation studies isolating the contribution of each optimization, report perplexity on validation sets for both AutoSP-generated and hand-written runs, and add an error analysis addressing training stability and any observed differences. These additions will directly compare model quality and stability. revision: yes

standing simulated objections not resolved

Provision of machine-checked proofs or formal verification steps for semantic preservation of the compiler transformations.

Circularity Check

0 steps flagged

No significant circularity in AutoSP systems description

full rationale

The paper presents a compiler-based engineering system (AutoSP) for automating sequence parallelism and activation checkpointing in LLM training. It contains no equations, mathematical derivations, fitted parameters, or closed-form results that could reduce to their own inputs by construction. Claims rest on implementation details and empirical throughput measurements rather than any self-referential chain. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. This is a standard self-contained systems paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a compiler can safely and automatically insert sequence parallelism and context-aware checkpointing into arbitrary LLM training graphs without semantic changes or performance regressions.

axioms (2)

domain assumption Compiler transformations preserve model semantics and training correctness when applying sequence parallelism and activation checkpointing.
Invoked implicitly when claiming automatic optimization works for general models.
domain assumption Existing training frameworks (ZeRO-3/FSDP, tensor/pipeline parallelism) can be composed with the new sequence-parallelism pass without conflicts.
Required for the 'compositions of various complex long-context optimizations' to be automated.

pith-pipeline@v0.9.0 · 5518 in / 1194 out tokens · 67596 ms · 2026-05-07T10:07:31.559642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page