pith. machine review for the scientific record. sign in

arxiv: 2605.11215 · v1 · submitted 2026-05-11 · 💻 cs.DC · cs.AI

Recognition: 2 theorem links

· Lean Theorem

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:55 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords resilient trainingfault toleranceLLM pre-trainingGPU clustersdistributed systemscollective communicationworkload redistribution
0
0 comments X

The pith

ReCoVer keeps the per-iteration gradient distribution identical to failure-free LLM pre-training by holding microbatch count constant after any GPU losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReCoVer to handle routine hardware faults during large-scale LLM pre-training without letting the optimization path drift. It enforces one core rule: every iteration must still consume exactly the same number of microbatches as a perfect run, achieved by isolating faults in collectives, recovering partial progress inside the step, and reassigning the remaining work to surviving GPUs. This keeps the stochastic properties of the gradient unchanged, unlike checkpoint-restart methods that can alter the trajectory. A reader would care because the result is measurably more tokens processed per GPU-hour when failures occur repeatedly.

Core claim

ReCoVer upholds the invariant that each iteration processes a fixed number of microbatches regardless of which GPUs fail. This is realized through three decoupled layers: fault-tolerant collectives that prevent error propagation, in-step recovery that salvages intra-iteration work, and a versatile-workload policy that redistributes microbatch quotas to survivors. The design works as a drop-in layer for both 3D parallelism and HSDP. On up to 512-GPU runs with 256 GPUs lost across the job, the system matches the failure-free loss curve while delivering 2.23 times the effective throughput of checkpoint-restart baselines and 74.9 percent more tokens in 234 GPU-hours.

What carries the argument

The constant-microbatch invariant per iteration, maintained by fault-tolerant collectives, in-step recovery, and dynamic redistribution of microbatch quotas to survivors.

Load-bearing premise

Keeping the microbatch count fixed across surviving GPUs produces gradients whose statistical properties remain identical to a failure-free run and do not accumulate bias or divergence over long training.

What would settle it

Run an identical pre-training job once with ReCoVer under injected failures and once without failures; if the loss curves or downstream metrics diverge beyond normal run-to-run variance, the stochastic-equivalence claim is false.

Figures

Figures reproduced from arXiv: 2605.11215 by Avinash Maurya, Bogdan Nicolae, Franck Cappello, Hui Zhou, Paul Hovland, Ruijie Zhang, Sheng Di, Zhengyang Wang, Zheng Zhang, Ziyue Liu.

Figure 1
Figure 1. Figure 1: Comparison of a classical synchronous iteration (left) and [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Flowchart of a RECOVER iteration and how its three-layer protocol interacts. communicator health before each all-reduce. Upon failure, it repairs the communicator over surviving ranks and either early-returns or performs a guarded reduction, ensuring no fatal errors and returning Algorithm 1 A RECOVER iteration. Require: policy P; replica role ρ; policy-assigned per￾role workload P(ρ); target total workloa… view at source ↗
Figure 3
Figure 3. Figure 3: RECOVER bottom layer. Why ULFM is a good foundation. User-Level Failure Mitiga￾tion (ULFM) [19] extends MPI with the minimal semantics these primitives need: no MPI call blocks indefinitely after a failure; it either succeeds or returns a typed error. Unlike NCCL or conven￾tional MPI, where any rank loss aborts the job, ULFM exposes failures at the communicator level and leaves recovery path to the applica… view at source ↗
Figure 4
Figure 4. Figure 4: RECOVER middle layer Why fine-grained recovery matters at scale. A single LLM pre-training iteration spans many microbatches, yielding global batches of hundreds to thousands of millions of tokens that scale with total training tokens [41]. This regime is not only viable but desirable, as it amortizes latency-bound cross-replica communication and improves GPU utilization. Consequently, one iteration may la… view at source ↗
Figure 5
Figure 5. Figure 5: Versatile workload across two failures. (i) Pre-failure: all replicas are major. (ii) First failure [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RECOVER integration with model parallelism A realistic LLM pre-training system distributes one replica across many devices, each holding a distinct shard of parame￾ters and gradients. RECOVER naturally lifts onto this setting: every intra-replica rank fires the cross-replica all-reduce and runs the protocol in lockstep. Therefore, RECOVER is agnostic to replica internals and versatile across parallelism sc… view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory preservation under 256 GPU losses on a 512-GPU 3D-parallelism run. (a) The RECOVER-3D training loss curve matches the failure-free NCCL reference, with no spikes or measurable deviation. (b) RECOVER-3D improves per-GPU utilization along the failures due to versatile workload thus surpassing the failure-free reference. grad-accum factor Ginit = 128 given a global B = Winit · Ginit = 8192 microbat… view at source ↗
Figure 8
Figure 8. Figure 8: Cost comparison between RECOVER-3D and restart-from-checkpoint. (a) Effective throughput across successive failures; RECOVER-3D keeps increasing and enlarges the gap between baseline as the growing per-GPU workload amortizes the cross-replica all-reduce cost. (b) Cumulative training progress in tokens vs GPU-hours; RECOVER-3D processes 74.9% more tokens at 234 GPU￾hours. (c) Single-failure raw wall-clock b… view at source ↗
Figure 9
Figure 9. Figure 9: Trajectory preservation under 256 GPU losses on a 512-GPU HSDP run. (a) The RECOVER￾HSDP training loss curve matches the failure-free NCCL reference throughout the run, with no spikes or measurable deviation. (b) The corresponding effective throughput: RECOVER-HSDP matches NCCL closely, and surpasses it after successive failures as RECOVER increases per-iteration workload and improves per-GPU utilization. … view at source ↗
Figure 10
Figure 10. Figure 10: Cost comparison between RECOVER-HSDP and restart-from-checkpoint. (a) Effective throughput across successive failures; RECOVER-HSDP is consistently higher than checkpoint baseline. (b) Cumulative training progress in tokens vs GPU-hours; RECOVER-HSDP processes 47.4% more tokens at 1338 GPU-hours. (c) Single-failure raw wall-clock breakdown swept over checkpoint interval N. RECOVER wins even at baseline’s … view at source ↗
read the original abstract

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents ReCoVer, a resilient LLM pre-training system that upholds the invariant of keeping the number of microbatches per iteration constant to ensure per-iteration gradients remain stochastically equivalent to a failure-free run. It consists of three layers: fault-tolerant collectives to isolate faults, in-step fine-grained recovery to preserve intra-iteration progress, and a versatile-workload policy for dynamic microbatch quota redistribution to survivors. The design is parallelism-agnostic and integrates with 3D parallelism and HSDP. End-to-end evaluations on up to 512 GPUs with up to 256 failures across the run show preservation of the training trajectory from a failure-free reference, 2.23× higher effective throughput than checkpoint-and-restart baselines, and 74.9% more tokens processed at 234 GPU-hours.

Significance. If the invariant on gradient equivalence holds under redistribution and the empirical results prove robust, ReCoVer addresses a critical practical challenge in large-scale distributed training where hardware faults are routine. The reported throughput gains and trajectory preservation could enable more efficient utilization of massive GPU clusters without sacrificing model quality, with the parallelism-agnostic drop-in design adding practical value.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The central claim that constant microbatch count ensures stochastically equivalent gradients (and thus preserved training trajectory despite 256 GPU losses) is load-bearing, yet the versatile-workload policy's dynamic redistribution of quotas to survivors risks altering data selection or ordering. Without explicit mechanisms such as synchronized data sampling or checkpointed data loader state, small discrepancies could accumulate over long training runs, undermining the equivalence invariant. The reported 2.23× throughput and 74.9% more tokens depend on this not occurring.
  2. [Evaluation] Evaluation section: The end-to-end results on 512 GPUs report a 2.23× effective throughput gain and trajectory preservation after successive failures, but lack details on failure injection method, how equivalence is quantified (e.g., loss curve comparisons or gradient statistics), and statistical significance of the gains. This makes verification of the performance claims difficult and is load-bearing for the main empirical contribution.
minor comments (2)
  1. [Abstract] Abstract: 'despite of 256 GPUs lost' is grammatically incorrect and should read 'despite 256 GPUs being lost' or 'despite the loss of 256 GPUs'.
  2. The paper would benefit from a dedicated limitations section discussing edge cases, such as faults occurring mid-iteration or interactions with specific data loaders.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the importance of the equivalence invariant and experimental details. We address each point below and will revise the manuscript to incorporate clarifications and additional information.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim that constant microbatch count ensures stochastically equivalent gradients (and thus preserved training trajectory despite 256 GPU losses) is load-bearing, yet the versatile-workload policy's dynamic redistribution of quotas to survivors risks altering data selection or ordering. Without explicit mechanisms such as synchronized data sampling or checkpointed data loader state, small discrepancies could accumulate over long training runs, undermining the equivalence invariant. The reported 2.23× throughput and 74.9% more tokens depend on this not occurring.

    Authors: We agree that explicit mechanisms are essential to uphold the invariant and that the current manuscript would benefit from a clearer description. The versatile-workload policy redistributes microbatch quotas while preserving global data ordering through a synchronized data loader that maintains a shared iteration state and checkpoints the sampling position at the beginning of each iteration across survivors. This ensures data selection and ordering remain identical to the failure-free case. We will expand the System Design and Evaluation sections with pseudocode and a dedicated paragraph detailing this synchronization to make the safeguards explicit. revision: yes

  2. Referee: [Evaluation] Evaluation section: The end-to-end results on 512 GPUs report a 2.23× effective throughput gain and trajectory preservation after successive failures, but lack details on failure injection method, how equivalence is quantified (e.g., loss curve comparisons or gradient statistics), and statistical significance of the gains. This makes verification of the performance claims difficult and is load-bearing for the main empirical contribution.

    Authors: We acknowledge that the Evaluation section would be strengthened by additional methodological details for reproducibility. In the revised version we will add: (1) failure injection via random process termination at specified iteration boundaries during the run; (2) equivalence quantification through side-by-side loss curves, per-iteration gradient norm comparisons, and final model perplexity; and (3) statistical significance via three independent runs with reported means and standard deviations. These elements will be integrated into the Evaluation section and the experimental setup description. revision: yes

Circularity Check

0 steps flagged

No circularity; design invariant and empirical results are independent

full rationale

The paper's core claim rests on a stated design invariant (constant microbatches per iteration) that is upheld by the three protocol layers and then validated through direct end-to-end experiments comparing against checkpoint-restart baselines on up to 512 GPUs. No equations, fitted parameters, or self-citations are presented that reduce the reported throughput gains, token counts, or trajectory preservation to the inputs by construction. The stochastic-equivalence statement is an explicit assumption of the workload policy rather than a derived result that loops back on itself. This is a standard systems paper whose performance numbers are externally falsifiable via the described runs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design rests on standard assumptions about fault detection and communication primitives in distributed GPU systems; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Hardware faults can be detected and isolated without corrupting ongoing gradient computation when using specialized collectives.
    Invoked in the description of fault-tolerant collectives and in-step recovery layers.
  • domain assumption Redistributing microbatch quotas across surviving GPUs preserves the stochastic properties of the original training trajectory.
    Central to the versatile-workload policy and the equivalence invariant.

pith-pipeline@v0.9.0 · 5591 in / 1525 out tokens · 25831 ms · 2026-05-13T01:55:57.685577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Bland, A

    W. Bland, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Post-failure recovery of mpi communication capability: Design and rationale.The International Journal of High Performance Computing Applications, 27(3):244–254, 2013

  2. [2]

    Bland, H

    W. Bland, H. Lu, S. Seo, and P. Balaji. Lessons learned implementing user-level failure mitigation in mpich. In2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing, pages 1123–1126. IEEE, 2015

  3. [3]

    Bouteiller, G

    A. Bouteiller, G. Bosilca, and J. J. Dongarra. Plan b: Interruption of ongoing mpi operations to support failure recovery. InProceedings of the 22nd European MPI Users’ Group Meeting, pages 1–9, 2015

  4. [4]

    S. Dash, I. R. Lyngaas, J. Yin, X. Wang, R. Egele, J. A. Ellis, M. Maiterth, G. Cong, F. Wang, and P. Balaprakash. Optimizing distributed training on frontier for large language models. In ISC High Performance 2024 Research Paper Proceedings (39th International Conference), pages 1–11. Prometeus GmbH, 2024

  5. [5]

    Eisenman, K

    A. Eisenman, K. K. Matam, S. Ingram, D. Mudigere, R. Krishnamoorthi, K. Nair, M. Smelyan- skiy, and M. Annavaram. {Check-N-Run}: A checkpointing system for training deep learning recommendation models. In19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 929–943, 2022

  6. [6]

    Gandhi, M

    S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis. Recycle: Resilient training of large dnns using pipeline adaptation. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 211–228, 2024

  7. [7]

    M. Gooding. xai targets one million gpus for colossus supercomputer in memphis, 2024

  8. [8]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  9. [9]

    S. Hasan. Scaling llama4 training to 100k, 2026

  10. [10]

    Q. Hu, Z. Ye, Z. Wang, G. Wang, M. Zhang, Q. Chen, P. Sun, D. Lin, X. Wang, Y . Luo, et al. Characterization of large language model development in the datacenter. In21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 709–729, 2024

  11. [11]

    Huang, Y

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  12. [12]

    I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury. Oobleck: Resilient distributed training of large models using pipeline templates. InProceedings of the 29th Symposium on Operating Systems Principles, pages 382–395, 2023

  13. [13]

    M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang. Analysis of{Large- Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads. In2019 USENIX Annual Technical Conference (USENIX ATC 19), pages 947–960, 2019. 10

  14. [14]

    Jiang, H

    Z. Jiang, H. Lin, Y . Zhong, Q. Huang, Y . Chen, Z. Zhang, Y . Peng, X. Li, C. Xie, S. Nong, et al. {MegaScale}: Scaling large language model training to more than 10,000 {GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, 2024

  15. [15]

    Kokolis, M

    A. Kokolis, M. Kuchnik, J. Hoffman, A. Kumar, P. Malani, F. Ma, Z. DeVito, S. Sengupta, K. Saladi, and C.-J. Wu. Revisiting reliability in large-scale machine learning research clusters. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1259–1274. IEEE, 2025

  16. [16]

    Laguna, D

    I. Laguna, D. F. Richards, T. Gamblin, M. Schulz, and B. R. de Supinski. Evaluating user-level fault tolerance for mpi applications. InProceedings of the 21st European MPI Users’ Group Meeting, pages 57–62, 2014

  17. [17]

    J. Lee, Z. Chen, X. He, R. Underwood, B. Nicolae, F. Cappello, X. Lu, S. Di, and Z. Zhang. Spare: Stacked parallelism with adaptive reordering for fault-tolerant llm pretraining systems with 100k+ gpus.arXiv preprint arXiv:2603.00357, 2026

  18. [18]

    J. Li, G. Bosilca, A. Bouteiller, and B. Nicolae. Elastic deep learning through resilient collective operations. InProceedings of the SC’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, pages 44–50, 2023

  19. [19]

    Losada, P

    N. Losada, P. González, M. J. Martín, G. Bosilca, A. Bouteiller, and K. Teranishi. Fault tolerance of mpi applications in exascale systems: The ulfm solution.Future Generation Computer Systems, 106:467–481, 2020

  20. [20]

    Maurya, M

    A. Maurya, M. M. Rafique, F. Cappello, and B. Nicolae. Datastates-llm: Scalable checkpointing for transformer models using composable state providers.arXiv preprint arXiv:2601.16956, 2026

  21. [21]

    Maurya, R

    A. Maurya, R. Underwood, M. M. Rafique, F. Cappello, and B. Nicolae. Datastates-llm: Lazy asynchronous checkpointing for large language models. InProceedings of the 33rd international symposium on high-performance parallel and distributed computing, pages 227–239, 2024

  22. [22]

    Mohan, A

    J. Mohan, A. Phanishayee, and V . Chidambaram. {CheckFreq}: Frequent, {Fine- Grained}{DNN} checkpointing. In19th USENIX Conference on File and Storage Technologies (FAST 21), pages 203–216, 2021

  23. [23]

    Narayanan, A

    D. Narayanan, A. Harlap, A. Phanishayee, V . Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia. Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM symposium on operating systems principles, pages 1–15, 2019

  24. [24]

    Narayanan, M

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V . Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. InProceedings of the international conference for high performance computing, networking, storage and analysis, pages 1–15, 2021

  25. [25]

    Nicolae, A

    B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello. Veloc: Towards high performance adaptive asynchronous checkpointing at large scale. In2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 911–920. IEEE, 2019

  26. [26]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  27. [27]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  28. [28]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020. 11

  29. [29]

    S., Narang, S., Edunov, S., Naumov, M., Tang, C., and Oldham, M

    O. Salpekar, R. Varma, K. Yu, V . Ivanov, Y . Wang, A. Sharif, M. Si, S. Xu, F. Tian, S. Zheng, et al. Training llms with fault tolerant hsdp on 100,000 gpus.arXiv preprint arXiv:2602.00277, 2026

  30. [30]

    Horovod: fast and easy distributed deep learning in TensorFlow

    A. Sergeev and M. Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018

  31. [31]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  32. [32]

    N. Tazi, F. Mom, H. Zhao, P. Nguyen, M. Mekkouri, L. Werra, and T. Wolf. The ultra-scale playbook: Training llms on gpu clusters. 2025.URl: https://huggingface. co/spaces/nanotron/ultrascaleplaybook, 2025

  33. [33]

    Thorpe, P

    J. Thorpe, P. Zhao, J. Eyolfson, Y . Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu. Bamboo: Making preemptible instances resilient for affordable training of large{DNNs}. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 497–513, 2023

  34. [34]

    B. Wan, M. Han, Y . Sheng, Y . Peng, H. Lin, M. Zhang, Z. Lai, M. Yu, J. Zhang, Z. Song, et al. {ByteCheckpoint}: A unified checkpointing system for large foundation model development. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 559–578, 2025

  35. [35]

    B. Wan, G. Liu, Z. Song, J. Wang, Y . Zhang, G. Sheng, S. Wang, H. Wei, C. Wang, W. Lou, et al. Robust llm training infrastructure at bytedance. InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 186–203, 2025

  36. [36]

    W. Wang, N. Yu, S. Xiong, and Z. Liu. Reliable and resilient collective communication library for llm training and serving.arXiv preprint arXiv:2512.25059, 2025

  37. [37]

    Z. Wang, Z. Jia, S. Zheng, Z. Zhang, X. Fu, T. E. Ng, and Y . Wang. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProceedings of the 29th Symposium on Operating Systems Principles, pages 364–381, 2023

  38. [38]

    Z. Wang, Z. Liu, R. Zhang, A. Maurya, P. Hovland, B. Nicolae, F. Cappello, and Z. Zhang. Boost: Bottleneck-optimized scalable training framework for low-rank large language models. arXiv preprint arXiv:2512.12131, 2025

  39. [39]

    Xiong, Y

    Y . Xiong, Y . Jiang, Z. Yang, L. Qu, G. Zhao, S. Liu, D. Zhong, B. Pinzur, J. Zhang, Y . Wang, et al. {SuperBench}: Improving cloud {AI} infrastructure reliability with proactive validation. In2024 USENIX Annual Technical Conference (USENIX ATC 24), pages 835–850, 2024

  40. [40]

    Zhang, D

    H. Zhang, D. Morwani, N. Vyas, J. Wu, D. Zou, U. Ghai, D. Foster, and S. Kakade. How does critical batch size scale in pre-training?arXiv preprint arXiv:2410.21676, 2024

  41. [41]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. A Additional Evaluations and Details This appendix provides additional evaluation details and the results for RECOVER-HSDP that cannot be inclu...