pith. sign in

arxiv: 2607.01646 · v1 · pith:QUUXD2QVnew · submitted 2026-07-02 · 💻 cs.LG · cs.DC

DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint

Pith reviewed 2026-07-03 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords LLM trainingfault tolerancecheckpointinghot-swappingdistributed trainingGPU failuresresilient systems
0
0 comments X

The pith

DeadPool lets LLM training jobs survive permanent node failures by hot-swapping spares using zero-overhead in-memory checkpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeadPool as a fault-tolerance system for large-scale LLM training that replaces failed nodes with spares at runtime instead of restarting the entire job. It achieves this by storing spatial redundancy through in-memory checkpoints that run entirely off the critical path and overlap completely with normal computation, plus a protocol that rebuilds the communicator group on the fly. A sympathetic reader cares because training runs that use thousands of GPUs for months encounter hardware and software failures often enough that restart costs become material, and prior methods force a choice between slowdown during normal runs and long recovery times. If the approach works, training throughput remains unchanged until a failure occurs and then resumes with only minimal recomputation after the swap. The evaluation reports the combination holds at 512 GPUs and 65B-parameter models with recovery under 40 seconds.

Core claim

DeadPool restores LLM training via hot-swapping of failed nodes with spare nodes without terminating the job, enabled by an off-critical-path in-memory checkpointing mechanism that supplies spatial redundancy and a communicator reconstruction protocol that operates at runtime; the checkpointing overlaps fully with computation so that error-free execution incurs zero overhead, and upon permanent failure the system rebuilds memory states from the in-memory copies with only minimal recomputation.

What carries the argument

Off-critical-path in-memory checkpointing for spatial redundancy together with the runtime communicator reconstruction protocol that performs the hot-swap.

If this is right

  • LLM training jobs continue after a node failure instead of requiring a full restart.
  • Error-free runs incur no measurable slowdown from the added fault-tolerance machinery.
  • Recovery from permanent node loss completes in under 40 seconds even at hundreds of GPUs.
  • The same checkpoint data that enables fast recovery also avoids recomputing large portions of prior work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could extend to other collective-communication workloads that already maintain spare capacity.
  • If spare-node availability proves unreliable in practice, the recovery guarantee would need an alternative fallback path.
  • The zero-overhead property could be verified by comparing iteration times with and without the checkpointing threads active on identical hardware.

Load-bearing premise

The checkpointing work can be placed completely off the critical path so that it truly adds zero cost to normal execution, and spare nodes plus the reconstruction protocol can be assumed available without creating new failure modes or added latency.

What would settle it

A measurement that shows the in-memory checkpointing step measurably slows the forward or backward pass on the same hardware, or a failure scenario where the hot-swap recovery exceeds 40 seconds at the 512-GPU scale.

Figures

Figures reproduced from arXiv: 2607.01646 by Haotian Xie, Junlin Chen, Lishan Yang, Mingkai Zheng, Zhao Zhang.

Figure 1
Figure 1. Figure 1: Conceptual landscape of existing fault-tolerance strate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview of DEADPOOL. The normal path (top) runs a two-phase offload pipeline overlapped with each train￾ing step. The failure path (bottom) detects faults, replaces the failed node, and reconstructs state from in-memory replicas. Training Iteration Writing Backup Replay Cost Conventional Checkpint Recovery DeadPool Hot-Swapping T1 T2 T3 T4 T3 T1 T2 T4 Recovery Cost ... ... ... [PITH_FULL_IMAGE:fi… view at source ↗
Figure 3
Figure 3. Figure 3: Timeline comparison of conventional checkpoint re [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the asynchronous offload pipeline across [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ping-pong double buffering across iterations. Two [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-iteration training time for 7B, 21B, and 65B models [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-iteration communication breakdown of D [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Breakdown of recovery latency across model scales [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Breakdown of recovery latency and end-to-end recov [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗
read the original abstract

State-of-the-art large language model (LLM) training takes tens of thousands of graphics processing units (GPUs) for months and encounters failures across the software and hardware stack. Existing fault-tolerance mechanisms either impose non-trivial overhead during failure-free execution or suffer from prolonged recovery latency, particularly under scenarios where a small subset of compute nodes experience permanent failures. %The tradeoff between failure-free overhead and recovery latency forms a space forms a Pareto frontier We present DeadPool to simultaneously address both optimization objectives. DeadPool incorporates a fault-tolerance mechanism that restores LLM training via hot-swapping, namely by replacing failed nodes with spare nodes without terminating the complete job. The hot-swapping of DeadPool is enabled by two ideas: First, it exploits an off-critical-path in-memory checkpointing mechanism for spatial redundancy. Second, it introduces a communicator reconstruction protocol that replaces failed nodes with spare nodes at runtime. DeadPool efficiently overlaps the in-memory checkpointing with computation, thus introducing zero overhead during error-free execution. Upon permanent node failures, DeadPool can rebuild memory states with minimal recomputation by leveraging in-memory checkpoints. We evaluate DeadPool across scales (up to 512 NVIDIA A100 GPUs) and LLMs (up to 65B parameters), and observe zero checkpoint overhead with hot-swapping recovery completing in under 40 seconds. These results show that DeadPool simultaneously achieves both zero-overhead error-free execution and extremely low recovery cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript presents DeadPool, a fault-tolerance mechanism for large-scale LLM training. It achieves zero overhead during error-free execution via an off-critical-path in-memory checkpointing scheme that overlaps with computation, and enables hot-swapping recovery upon permanent node failures through a communicator reconstruction protocol that replaces failed nodes with spares while leveraging the checkpoints for minimal recomputation. The system is evaluated at scales up to 512 NVIDIA A100 GPUs and models up to 65B parameters, with reported results of zero checkpoint overhead and recovery completing in under 40 seconds.

Significance. If the reported measurements hold, the work is significant because it simultaneously eliminates the failure-free overhead that typically accompanies checkpointing and delivers extremely low recovery latency, addressing a central practical barrier to reliable long-running distributed training jobs. The direct empirical support at substantial scale (512 GPUs, 65B parameters) constitutes independent evidence for the overlap and reconstruction mechanisms.

minor comments (1)
  1. Abstract: the largest-scale configuration (exact model size, number of nodes, and failure injection details) could be stated more precisely to allow readers to immediately assess the scope of the zero-overhead and sub-40 s claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of DeadPool, the recognition of its significance for large-scale LLM training, and the recommendation for minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a systems paper presenting a fault-tolerance mechanism for LLM training. It contains no mathematical derivations, equations, fitted parameters, or predictions derived from models. All claims rest on described design choices (off-critical-path checkpointing, communicator reconstruction) and direct empirical measurements (zero overhead observed, <40s recovery on up to 512 GPUs). No self-citation chains, self-definitional steps, or renamings of known results appear. The evaluation constitutes independent evidence outside any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on the abstract alone, the central claim rests on domain assumptions about cluster hardware availability and the feasibility of off-path checkpointing; no free parameters or new invented entities beyond the described protocol are identifiable from the provided text.

axioms (1)
  • domain assumption Spare compute nodes are available and the network allows runtime communicator reconstruction without job termination
    Required for the hot-swapping mechanism to function as described.
invented entities (1)
  • Communicator reconstruction protocol no independent evidence
    purpose: Enables replacement of failed nodes at runtime while preserving training state via in-memory checkpoints
    New protocol introduced to support hot-swapping

pith-pipeline@v0.9.1-grok · 5802 in / 1270 out tokens · 34413 ms · 2026-07-03T17:24:17.990981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 39 canonical work pages · 6 internal anchors

  1. [1]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  2. [2]

    Opt-175b baselines logbook,

    MetaSeq, “Opt-175b baselines logbook,” 2022, https://github. com/facebookresearch/metaseq/blob/main/projects/OPT/chronicles/ OPT175B Logbook.pdf

  3. [4]

    Efficient large-scale language model training on gpu clusters using megatron-lm,

    D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, and B. Catanzaro, “Efficient large-scale language model training on gpu clusters using megatron-lm,”arXiv preprint arXiv:2104.04473, 2021. [Online]. Available: https://arxiv.org/abs/2104.04473

  4. [5]

    Datastates-llm: Lazy asynchronous checkpointing for large language models,

    A. Maurya, R. Underwood, M. M. Rafique, F. Cappello, and B. Nicolae, “Datastates-llm: Lazy asynchronous checkpointing for large language models,” inProceedings of the 33rd international symposium on high- performance parallel and distributed computing, 2024, pp. 227–239. doi: 10.1145/3625549.3658685

  5. [6]

    Bytecheckpoint: a unified checkpointing system for large foundation model development,

    B. Wan, M. Han, Y . Sheng, Y . Peng, H. Lin, M. Zhang, Z. Lai, M. Yu, J. Zhang, Z. Song, X. Liu, and C. Wu, “Bytecheckpoint: a unified checkpointing system for large foundation model development,” inProceedings of the 22nd USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI ’25. USA: USENIX Association, 2025. [Online]. Available: ht...

  6. [7]

    Transom: An Efficient Fault-Tolerant System for Training LLMs,

    B. Wu, L. Xia, Q. Li, K. Li, X. Chen, Y . Guo, T. Xiang, Y . Chen, and S. Li, “Transom: An Efficient Fault-Tolerant System for Training LLMs,” 2023. doi: 10.48550/arXiv.2310.10046 . [Online]. Available: https://arxiv.org/abs/2310.10046

  7. [8]

    Universal checkpointing: a flexible and efficient dis- tributed checkpointing system for large-scale dnn training with recon- figurable parallelism,

    X. Lian, S. A. Jacobs, L. Kurilenko, M. Tanaka, S. Bekman, O. Ruwase, and M. Zhang, “Universal checkpointing: a flexible and efficient dis- tributed checkpointing system for large-scale dnn training with recon- figurable parallelism,” inProceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ’25. USA: USENIX Associ...

  8. [9]

    CheckFreq: Frequent, Fine-Grained DNN checkpointing,

    J. Mohan, A. Phanishayee, and V . Chidambaram, “CheckFreq: Frequent, Fine-Grained DNN checkpointing,” in19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, Feb. 2021, pp. 203–216. [Online]. Available: https: //www.usenix.org/conference/fast21/presentation/mohan

  9. [10]

    Bamboo: Making preemptible instances resilient for affordable training of large DNNs,

    J. Thorpe, P. Zhao, J. Eyolfson, Y . Qiao, Z. Jia, M. Zhang, R. Netravali, and G. H. Xu, “Bamboo: Making preemptible instances resilient for affordable training of large DNNs,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 497–513. [Online]. Available: https://arxiv.org/abs/2204.12013

  10. [11]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, “Pytorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019. [Online]. Available: https://arxiv.org/abs/1912.01703

  11. [12]

    Fti: high performance fault tolerance interface for hybrid systems,

    L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka, “Fti: high performance fault tolerance interface for hybrid systems,” inProceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY , USA: Association for Computing Machinery, 2011. doi: 10.11...

  12. [13]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altmanet al., “On the opportunities and risks of foundation models,”Preprint arXiv:2108.07258, 2021. [Online]. Available: https://arxiv.org/abs/2108. 07258

  13. [14]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre, “Training compute-optimal large language models,”arXiv preprint arXiv:2203.15...

  14. [15]

    Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,

    T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,”ACM Computing Surveys, vol. 52, no. 4, pp. 1–43, 2019. doi: https://doi.org/10.1145/3320060

  15. [16]

    Sequence parallelism: Long sequence training from system perspective,

    S. Li, F. Xue, C. Baranwal, Y . Li, and Y . You, “Sequence parallelism: Long sequence training from system perspective,” 2022. [Online]. Available: https://arxiv.org/abs/2105.13120

  16. [17]

    Blockwise parallel transformers for large context models,

    H. Liu and P. Abbeel, “Blockwise parallel transformers for large context models,”Advances in Neural Information Processing Systems, vol. 36,

  17. [18]

    doi: 10.5555/3666122.3666508

  18. [19]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    H. Liu, M. Zaharia, and P. Abbeel, “Ring attention with blockwise transformers for near-infinite context,” 2023. [Online]. Available: https://arxiv.org/abs/2310.01889

  19. [20]

    A large-scale study of soft-errors on gpus in the field,

    B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers, “A large-scale study of soft-errors on gpus in the field,” in2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12-16, 2016. IEEE Computer Society, 2016, pp. 519–530. doi: 10.1109/HPCA.2016.7446091 . [Online]. Available: https://doi.org/1...

  20. [21]

    A systematic study of ddr4 dram faults in the field,

    M. V . Beigi, Y . Cao, S. Gurumurthi, C. Recchia, A. Walton, and V . Sridharan, “A systematic study of ddr4 dram faults in the field,” in2023 IEEE International Symposium on High-Performance Com- puter Architecture (HPCA). IEEE, 2023, pp. 991–1002. doi: 10.1109/HPCA56546.2023.10071066

  21. [22]

    Code-dependent and architecture-dependent reliability be- haviors,

    V . Fratin, D. Oliveira, C. Lunardi, F. Santos, G. Rodrigues, and P. Rech, “Code-dependent and architecture-dependent reliability be- haviors,” in2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018, pp. 13–26. doi: 10.1109/DSN.2018.00015

  22. [23]

    Killi: Runtime fault classification to deploy low voltage caches without MBIST,

    S. Ganapathy, J. Kalamatianos, B. M. Beckmann, S. Raasch, and L. G. Szafaryn, “Killi: Runtime fault classification to deploy low voltage caches without MBIST,” in25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Washington, DC, USA, February 16-20, 2019. IEEE, 2019, pp. 304–316. doi: 10.1109/HPCA.2019.00046 . [Online]...

  23. [24]

    Addressing failures in exascale computing,

    M. Snir, R. W. Wisniewski, J. A. Abraham, S. V . Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. DeBardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyf- fer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V . Hensberg...

  24. [25]

    Radiation-induced error criticality in modern hpc parallel accelera- tors,

    D. A. G. De Oliveira, L. L. Pilla, M. Hanzich, V . Fratin, F. Fernandes, C. Lunardi, J. M. Cela, P. O. A. Navaux, L. Carro, and P. Rech, “Radiation-induced error criticality in modern hpc parallel accelera- tors,” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 577–588. doi: 10.1109/HPCA.2017.41

  25. [26]

    Quantifying the impact of memory errors in deep learning,

    Z. Zhang, L. Huang, R. Huang, W. Xu, and D. S. Katz, “Quantifying the impact of memory errors in deep learning,” in2019 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2019, pp. 1–12. doi: 10.1109/CLUSTER.2019.8890989

  26. [27]

    Understanding and mitigating hardware failures in deep learning training systems,

    Y . He, M. Hutton, S. Chan, R. De Gruijl, R. Govindaraju, N. Patil, and Y . Li, “Understanding and mitigating hardware failures in deep learning training systems,” inProceedings of the 50th Annual Inter- national Symposium on Computer Architecture, 2023, pp. 1–16. doi: 10.1145/3579371.3589105

  27. [28]

    Lifespan and failures of ssds and hdds: Similarities, differences, and prediction models,

    R. Pinciroli, L. Yang, J. Alter, and E. Smirni, “Lifespan and failures of ssds and hdds: Similarities, differences, and prediction models,” IEEE Transactions on Dependable and Secure Computing, 2021. doi: 10.1109/TDSC.2021.3131571

  28. [29]

    Horovod: fast and easy distributed deep learning in TensorFlow

    A. Sergeev and M. D. Balso, “Horovod: Fast and easy distributed deep learning in TensorFlow,”arXiv preprint arXiv:1802.05799, 2018. [Online]. Available: https://arxiv.org/abs/1802.05799

  29. [30]

    Analysis of large-scale multi-tenant gpu clusters for dnn training workloads,

    M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, “Analysis of large-scale multi-tenant gpu clusters for dnn training workloads,” 2019. [Online]. Available: https: //arxiv.org/abs/1901.05758

  30. [31]

    MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,

    Q. Weng, W. Xiao, Y . Yu, W. Wang, C. Wang, J. He, Y . Li, L. Zhang, W. Lin, and Y . Ding, “MLaaS in the wild: Workload analysis and scheduling in Large-Scale heterogeneous GPU clusters,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945–960. [Online]. Available: https://www.usenix.org/conference/nsdi22/presen...

  31. [32]

    Easyscale: Elastic training with con- sistent accuracy and improved utilization on gpus,

    M. Li, W. Xiao, H. Yang, B. Sun, H. Zhao, S. Ren, Z. Luan, X. Jia, Y . Liu, Y . Li, W. Lin, and D. Qian, “Easyscale: Elastic training with con- sistent accuracy and improved utilization on gpus,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14. doi: 10.1145/3581784.3607054

  32. [33]

    Dlrover-rm: Resource optimization for deep recommendation models training in the cloud,

    Q. Wang, T. Lan, Y . Tang, Z. Huang, Y . Du, H. Zhang, J. Sha, H. Lu, Y . Zhou, K. Zhang, and M. Tang, “Dlrover-rm: Resource optimization for deep recommendation models training in the cloud,” 2024. [Online]. Available: https://arxiv.org/abs/2304.01468

  33. [34]

    Varuna: scalable, low-cost training of massive deep learning models,

    S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: scalable, low-cost training of massive deep learning models,” inProceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 472–487. [Online]. Available: https://api.semanticscholar.org/CorpusID:243847496

  34. [35]

    Oobleck: Resilient distributed training of large models using pipeline templates,

    I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 382–395. doi: 10.1145/3600006.3613152

  35. [36]

    Parcae: Proactive,Liveput-Optimized DNN Training on Preemptible Instances,

    J. Duan, Z. Song, X. Miao, X. Xi, D. Lin, H. Xu, M. Zhang, and Z. Jia, “Parcae: Proactive,Liveput-Optimized DNN Training on Preemptible Instances,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 1121–1139. [Online]. Available: https://arxiv.org/abs/2403.14097

  36. [37]

    Slipstream: Adapting pipelines for distributed training of large dnns amid failures,

    S. Gandhi, M. Zhao, A. Skiadopoulos, and C. Kozyrakis, “Slipstream: Adapting pipelines for distributed training of large dnns amid failures,”arXiv preprint arXiv:2405.14009, 2024. doi: 10.48550/arXiv.2405.14009

  37. [38]

    CheckFreq: Frequent,Fine-Grained DNN Checkpointing,

    J. Mohan, A. Phanishayee, and V . Chidambaram, “CheckFreq: Frequent,Fine-Grained DNN Checkpointing,” in19th USENIX Conference on File and Storage Technologies (FAST 21), 2021, pp. 203–216. [Online]. Available: https://par.nsf.gov/biblio/10286595

  38. [39]

    Gemini: Fast failure recovery in distributed training with in-memory checkpoints,

    Z. Wang, Z. Jia, S. Zheng, Z. Zhang, X. Fu, T. E. Ng, and Y . Wang, “Gemini: Fast failure recovery in distributed training with in-memory checkpoints,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 364–381. doi: 10.1145/3600006.3613145

  39. [40]

    FastPersist: Accelerating Model Checkpointing in Deep Learning,

    G. Wang, O. Ruwase, B. Xie, and Y . He, “FastPersist: Accelerating Model Checkpointing in Deep Learning,”arXiv preprint arXiv:2406.13768, 2024. [Online]. Available: https: //arxiv.org/abs/2406.13768

  40. [41]

    Just-in-time checkpointing: Low cost error recovery from deep learning training failures,

    T. Gupta, S. Krishnan, R. Kumar, A. Vijeev, B. Gulavani, N. Kwatra, R. Ramjee, and M. Sivathanu, “Just-in-time checkpointing: Low cost error recovery from deep learning training failures,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 1110–1125. doi: 10.1145/3627703.3650085

  41. [42]

    Efficient fault tolerance for recommendation model training via erasure coding,

    T. Zhang, K. Liu, J. Kosaian, J. Yang, and R. Vinayak, “Efficient fault tolerance for recommendation model training via erasure coding,” Proceedings of the VLDB Endowment, vol. 16, no. 11, pp. 3137–3150,

  42. [43]

    doi: 10.14778/3611479.3611514

  43. [44]

    Check-N-Run: A checkpointing system for training deep learning recommendation models,

    A. Eisenman, K. K. Matam, S. Ingram, D. Mudigere, R. Krishnamoorthi, K. Nair, M. Smelyanskiy, and M. Annavaram, “Check-N-Run: A checkpointing system for training deep learning recommendation models,” in19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 929–943. [Online]. Available: https://www.usenix.org/conference/n...

  44. [45]

    Cpr: Understanding and improving failure tolerant training for deep learning recommendation with partial recovery,

    K. Maeng, S. Bharuka, I. Gao, M. C. Jeffrey, V . Saraph, B.-Y . Su, C. Trippel, J. Yang, M. Rabbat, B. Lucia, and C.-J. Wu, “Cpr: Understanding and improving failure tolerant training for deep learning recommendation with partial recovery,” 2020. [Online]. Available: https://arxiv.org/abs/2011.02999

  45. [46]

    Fine-grained policy-driven i/o sharing for burst buffers,

    E. Karrels, L. Huang, Y . Kan, I. Arora, Y . Wang, D. S. Katz, W. Gropp, and Z. Zhang, “Fine-grained policy-driven i/o sharing for burst buffers,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’23. New York, NY , USA: Association for Computing Machinery, 2023. doi: 10.1145/3581784.3...

  46. [47]

    Training llms with fault tolerant hsdp on 100,000 gpus,

    O. Salpekar, R. Varma, K. Yu, V . Ivanov, Y . Wang, A. Sharif, M. Si, S. Xu, F. Tian, S. Zheng, T. Rice, A. Garg, S. Peng, S. Siravara, W. Fu, R. de Castro, A. Gangidi, A. Obraztsov, S. Narang, S. Edunov, M. Naumov, C. Tang, and M. Oldham, “Training llms with fault tolerant hsdp on 100,000 gpus,” 2026. [Online]. Available: https://arxiv.org/abs/2602.00277