SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Bogdan Nicolae; Franck Cappello; Jin Lee; Robert Underwood; Sheng Di; Xiaoyi Lu; Xuhang He; Zheng Zhang; Zhonghao Chen

arxiv: 2603.00357 · v2 · pith:3RRWU6KQnew · submitted 2026-02-27 · 💻 cs.DC · cs.SY· eess.SY

SPARe: Stacked Parallelism with Adaptive Reordering for Fault-Tolerant LLM Pretraining Systems with 100k+ GPUs

Jin Lee , Zhonghao Chen , Xuhang He , Robert Underwood , Bogdan Nicolae , Franck Cappello , Xiaoyi Lu , Sheng Di

show 1 more author

Zheng Zhang

This is my paper

Pith reviewed 2026-05-21 11:23 UTC · model grok-4.3

classification 💻 cs.DC cs.SYeess.SY

keywords fault toleranceLLM pretrainingGPU clustersredundancyparallelismadaptive reorderinglarge-scale systems

0 comments

The pith

SPARe masks node failures in large-scale LLM pretraining by stacking redundant data shards across parallelism groups and adaptively reordering execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In LLM pretraining systems using 100k or more GPUs, hardware failures occur so often that restart costs dominate total training time. SPARe addresses this by stacking redundant data shards across parallelism groups and adaptively reordering execution to mask failures specifically during gradient synchronization. The result is availability comparable to full replication, yet with computation overhead that remains nearly constant at 2-3x even when redundancy is increased, rather than growing linearly. The authors derive closed-form expressions for the number of endurable failures and the resulting overhead, validate them in SimGrid discrete-event simulation, and jointly optimize redundancy level together with checkpointing. At scales up to 600k GPUs this approach reduces overall time-to-train by 40-50% relative to traditional replication.

Core claim

SPARe masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution, thereby achieving availability comparable to traditional replication while keeping computation overhead nearly constant at 2-3x even under high redundancy, and, after joint optimization of redundancy and checkpointing, reducing time-to-train by 40-50% at scales up to 600k GPUs.

What carries the argument

Stacked Parallelism with Adaptive Reordering (SPARe), which stacks redundant data shards across parallelism groups and adaptively reorders execution to mask failures during gradient synchronization.

If this is right

Closed-form expressions predict the maximum number of failures the system can tolerate and the resulting overhead.
Computation overhead remains bounded at 2-3x independent of the chosen redundancy level.
Joint optimization of redundancy and checkpointing minimizes total time-to-train.
At 600k GPUs the method reduces time-to-train by 40-50% compared with traditional replication.
The approach is validated through discrete-event simulation at extreme scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The constant-overhead property could enable training substantially larger models on the same hardware budget by lowering fault-tolerance costs.
Similar stacking and reordering techniques might transfer to other distributed training workloads that rely on gradient synchronization.
Production clusters could adopt SPARe-style redundancy as a default software feature rather than relying on hardware-level replication.
Real-cluster experiments would be needed to confirm that simulation results hold when unmodeled hardware behaviors appear.

Load-bearing premise

Node failures during gradient synchronization can be reliably masked by stacked redundant shards and adaptive reordering without introducing correctness errors or unmodeled synchronization costs in real large-scale GPU clusters.

What would settle it

A production run on a 100k+ GPU cluster in which either training produces incorrect results or measured overhead exceeds 3x under realistic failure rates would falsify the central performance claims.

read the original abstract

In large-scale LLM pre-training systems with 100k+ GPUs, failures become the norm rather than the exception, and restart costs can dominate wall-clock training time. However, existing fault-tolerance mechanisms are largely unprepared for this restart-dominant regime. To address this challenge, we propose SPARe - Stacked Parallelism with Adaptive Reordering - a fault-tolerance framework that masks node failures during gradient synchronization by stacking redundant data shards across parallelism groups and adaptively reordering execution. SPARe achieves availability comparable to traditional replication while maintaining near-constant computation overhead of only 2~3x, even under high redundancy where traditional replication would require linearly inflating overhead. We derive closed-form expressions for endurable failure count and computation overhead, validate them via SimGrid-based discrete-event simulation, and jointly optimize redundancy and checkpointing to minimize time-to-train. At extreme scale with up to 600k GPUs, SPARe reduces time-to-train by 40~50% compared to traditional replication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes SPARe, a fault-tolerance framework for LLM pretraining at 100k+ GPU scales. It stacks redundant data shards across parallelism groups and uses adaptive reordering during gradient synchronization to mask node failures. The approach claims availability comparable to traditional replication but with near-constant 2-3x computation overhead (instead of linear scaling with redundancy), derives closed-form expressions for endurable failures and overhead, validates via SimGrid discrete-event simulation, jointly optimizes redundancy and checkpointing, and reports 40-50% time-to-train reduction at up to 600k GPUs.

Significance. If the closed-form expressions and simulation results hold under realistic conditions, SPARe could meaningfully reduce restart-dominated training time at extreme scales without the overhead penalties of replication. The joint optimization of redundancy and checkpointing and the parameter-light closed-form derivations are positive elements that could support practical adoption if validated more rigorously.

major comments (3)

[Abstract and §4] Abstract and §4 (Simulation Validation): The central performance claims (constant 2-3x overhead and 40-50% TTT reduction at 600k GPUs) rest on SimGrid discrete-event simulation of stacked shards and adaptive reordering, but the manuscript provides no evidence that the simulator captures GPU-specific effects such as NVLink/InfiniBand asymmetry, partial node failures, or collective library behaviors that could inflate synchronization costs or violate the constant-overhead assumption.
[§3] §3 (Closed-Form Expressions): The derivations for endurable failure count and computation overhead are presented as independent of the simulation outcomes, yet no explicit assumptions, error bounds, or step-by-step derivation are supplied; this makes it impossible to assess whether the expressions remain valid when adaptive reordering introduces extra synchronization rounds or gradient correctness risks.
[§5] §5 (Joint Optimization and Extreme-Scale Results): The 40-50% TTT reduction at 600k GPUs is reported from the joint redundancy/checkpointing optimization, but without real-hardware traces or sensitivity analysis to unmodeled interconnect contention, the result cannot be confirmed as load-bearing rather than an artifact of the simulation parameters.

minor comments (2)

[§2] Notation for stacked parallelism groups and reordering steps is introduced without a clear diagram or pseudocode, making the adaptive execution mechanism difficult to follow.
[Table 2] The manuscript should include a table comparing SPARe overheads against replication and checkpointing baselines across a range of failure rates and GPU counts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We have addressed each of the major comments in detail below and have made substantial revisions to the manuscript to strengthen the presentation of the simulation methodology, derivations, and results. We believe these changes improve the clarity and rigor of the work.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Simulation Validation): The central performance claims (constant 2-3x overhead and 40-50% TTT reduction at 600k GPUs) rest on SimGrid discrete-event simulation of stacked shards and adaptive reordering, but the manuscript provides no evidence that the simulator captures GPU-specific effects such as NVLink/InfiniBand asymmetry, partial node failures, or collective library behaviors that could inflate synchronization costs or violate the constant-overhead assumption.

Authors: We agree that additional details on the simulation fidelity are warranted. SimGrid has been extensively used and validated for modeling large-scale distributed systems, including GPU clusters, with communication models based on empirical data from InfiniBand and similar networks. Our simulation abstracts the collective operations (e.g., AllReduce) using latency-bandwidth models calibrated to published measurements from systems like those with 10k+ GPUs. Partial node failures are modeled as full node losses since our focus is on node-level faults, which are the primary concern at this scale. We have revised §4 to include a new subsection on 'Simulation Assumptions and Limitations' that explicitly discusses these points, provides the parameter values used, and includes sensitivity analysis varying network asymmetry and collective overheads. The constant-overhead property holds under these models as the adaptive reordering adds only a fixed number of synchronization steps independent of scale. revision: yes
Referee: [§3] §3 (Closed-Form Expressions): The derivations for endurable failure count and computation overhead are presented as independent of the simulation outcomes, yet no explicit assumptions, error bounds, or step-by-step derivation are supplied; this makes it impossible to assess whether the expressions remain valid when adaptive reordering introduces extra synchronization rounds or gradient correctness risks.

Authors: We have expanded §3 to provide the complete step-by-step derivations. The expressions for endurable failures assume a binomial failure model with independent node failures at rate p, and that reordering is performed on redundant shards without altering the mathematical correctness of gradients (as each shard is processed exactly once per effective batch). Error bounds are derived using concentration inequalities, showing that the overhead remains within 2-3x with high probability for p up to 0.01. We have also added a proof sketch in the appendix addressing the impact of extra synchronization rounds, demonstrating that they contribute only a logarithmic factor in the number of parallelism groups, preserving the near-constant overhead. revision: yes
Referee: [§5] §5 (Joint Optimization and Extreme-Scale Results): The 40-50% TTT reduction at 600k GPUs is reported from the joint redundancy/checkpointing optimization, but without real-hardware traces or sensitivity analysis to unmodeled interconnect contention, the result cannot be confirmed as load-bearing rather than an artifact of the simulation parameters.

Authors: The joint optimization uses the closed-form models validated by simulation to minimize TTT. We have added extensive sensitivity analysis in the revised §5, varying parameters such as interconnect bandwidth (from 100Gbps to 400Gbps), failure rates, and checkpoint intervals. The 40-50% reduction persists across these variations, indicating robustness. Regarding real-hardware traces, this study is inherently simulation-based due to the extreme scales considered (up to 600k GPUs), which exceed current publicly available systems. However, the simulation parameters are grounded in traces from smaller-scale runs reported in the literature, and we discuss calibration methods for future hardware deployments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper derives closed-form expressions for endurable failure count and computation overhead, validates them via independent SimGrid discrete-event simulation, and jointly optimizes redundancy and checkpointing parameters to minimize time-to-train. No quoted equations or steps in the abstract or described validation reduce by construction to fitted inputs, self-citations, or renamed assumptions; the 40-50% TTT reduction at 600k GPUs is presented as an outcome of the optimization rather than an input. The approach follows standard first-principles derivation followed by external simulation validation, rendering the chain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about failure masking and simulation fidelity rather than explicit free parameters or new invented entities; no fitted constants are mentioned in the abstract.

axioms (1)

domain assumption Node failures can be masked during gradient synchronization by stacked redundant shards and adaptive reordering without correctness loss or major extra latency.
This premise underpins the availability and overhead claims but is not proven in the abstract.

pith-pipeline@v0.9.0 · 5737 in / 1196 out tokens · 48877 ms · 2026-05-21T11:23:13.791131+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
cs.DC 2026-05 unverdicted novelty 6.0

ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
cs.DC 2026-05 unverdicted novelty 6.0

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, ...