TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

Feng Jiang; Lichen Pan; Nannan Zhao; Patrick P. C. Lee; Shujie Han; Xiaonan Zhao; Xiao Zhang; Zhijie Huang

arxiv: 2605.17821 · v1 · pith:ULXCJOW2new · submitted 2026-05-18 · 💻 cs.DC · cs.AI

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

Shujie Han , Feng Jiang , Patrick P. C. Lee , Xiao Zhang , Zhijie Huang , Nannan Zhao , Xiaonan Zhao , Lichen Pan This is my paper

Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords checkpointingfault tolerancelarge language modelstiered storagedistributed trainingrecoveryLLM training

0 comments

The pith

TierCheck uses a three-tier checkpointing design to deliver under-10-second recovery times and low overhead for large language model training up to 40 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TierCheck as a system that places checkpoints according to different failure types in distributed LLM training. Lightweight differential checkpoints stay in local and peer memory for quick localized fixes, while full base checkpoints move asynchronously to remote storage for durability. The design keeps strict consistency across these tiers without pausing the training run. Evaluations confirm the approach supports frequent checkpointing and fast cluster-wide recovery while adding little training slowdown.

Core claim

TierCheck aligns storage tiers with failure heterogeneity by keeping lightweight differential checkpoints in local and peer memory for fast localized recovery and asynchronously migrating heavyweight base checkpoints to remote persistent storage, all while preserving strict global consistency across tiers without stalling the training process or introducing recovery errors.

What carries the argument

The three-tier design that stores lightweight differential checkpoints in local and peer memory for rapid recovery and moves base checkpoints asynchronously to remote storage while enforcing global consistency.

If this is right

Training runs can checkpoint at high frequency with end-to-end times under 10 seconds.
Localized failures can be recovered from memory without touching remote storage.
Overall training overhead from checkpointing stays low even at cluster scale.
Cluster-aware restoration becomes possible after widespread outages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tiered placement could apply to other long-running distributed workloads that face mixed failure rates.
Asynchronous migration may lower storage costs in environments where remote capacity is priced differently from memory.
Further scaling tests beyond 40 billion parameters would show whether memory-tier capacity becomes a bottleneck.

Load-bearing premise

The system can keep strict global consistency across memory tiers and remote storage during asynchronous migration without stalling training or creating errors on recovery.

What would settle it

A run on a 40-billion-parameter model in which a failure during asynchronous checkpoint migration produces an inconsistent state or a recovery error that the system cannot resolve.

Figures

Figures reproduced from arXiv: 2605.17821 by Feng Jiang, Lichen Pan, Nannan Zhao, Patrick P. C. Lee, Shujie Han, Xiaonan Zhao, Xiao Zhang, Zhijie Huang.

**Figure 1.** Figure 1: Architectural overview of TierCheck. after a failure. Its key techniques include: (i) global consensus on latest checkpoint, which identifies the latest globally recoverable version across ranks; (ii) cluster-aware checkpoint loading, which maps checkpoint shards to the current cluster topology and avoids unnecessary data movement; and (iii) fused multi-step differential checkpoint replaying, which recon… view at source ↗

**Figure 2.** Figure 2: Adaptive cross-tier transmission without stalling foreground training. 3.3 Checkpoint Retrieval Design goal. The retrieval path must prioritize both recovery speed and state consistency. Upon failure, TierCheck must identify the latest globally consistent state before resuming from the fastest surviving storage tier. To quantify checkpoint freshness, TierCheck defines a checkpoint version by its corres… view at source ↗

**Figure 3.** Figure 3: Availability of base-checkpoint anchors across different storage tiers under heterogeneous failures. • Local-anchor recovery (𝑆1 = 1). This category covers states (1, 0, 0), (1, 1, 0), (1, 0, 1), and (1, 1, 1). When a software failure terminates training while the host node remains healthy, TierCheck restores the base checkpoint directly from the local volatile copy, bypassing slower tiers entirely. The … view at source ↗

**Figure 5.** Figure 5: Comparison of naive and watermark-driven global reclamation. newest locally visible state is incorrect under heterogeneous failures, as local recency provides no guarantee of clusterwide redundancy. Reclamation must thus be governed by global recoverability rather than per-node progress. Watermark-driven global reclamation. TierCheck resolves this synchronization challenge by deferring garbage collectio… view at source ↗

**Figure 4.** Figure 4: Comparison of native and fused multi-step differential checkpoint replaying. To eliminate these overheads, TierCheck introducesfused multi-step differential checkpoint replaying. Rather than replaying the entire historical chain in a single monolithic pass, TierCheck streams the recovered differential checkpoints and replays them in bounded batches. Each replay batch contains 𝑁 consecutive differential ch… view at source ↗

**Figure 6.** Figure 6: (Exp#1) Average training time per iteration. TierCheck Gemini DataStates-LLM CheckFreq GPT2 20B BERT 20B 0 10 20 30 Checkpointing time (s) 3.1 3.2 16.7 12.9 13.8 11.5 19.6 20.0 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: (Exp#2) Checkpointing time. GPT2 20B BERT 20B 0 10 20 30 Ckpt. frequency (iterations) 5 5 11 12 11 11 21 22 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 9.** Figure 9: (Exp#5) Scalability of different model sizes. DP=16 (ZeRO-3) DP=4 PP=4 DP=4 TP=4 PP=4 TP=4 DP=4 TP/PP=2 0 1 2 3 Training time (s) No Ckpt. TierCheck [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 13.** Figure 13: (Exp#9) Storage overhead. it 14.4× slower than TierCheck. This performance gain is attributed to the fused operator’s ability to consolidate multiple optimizer updates into a single pass, thereby mitigating redundant memory I/O and cross-rank communication synchronizations. (Exp#8) Convergence accuracy. To verify algorithmic correctness without the prohibitive cost of 40B-scale end-to-end training, we … view at source ↗

read the original abstract

Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TierCheck applies tiered checkpointing to LLM training with local diffs and async remote bases, but the consistency story during migration needs more proof.

read the letter

The main thing to know is that this paper describes a three-tier checkpointing system meant to cut wasted compute when large LLM training jobs hit failures. It keeps lightweight differential checkpoints in local and peer memory for quick local recovery and ships full base checkpoints to remote storage in the background. The claim is that this keeps training overhead low, gets end-to-end checkpoint times under 10 seconds, and still supports frequent saves on models up to 40 billion parameters.

Referee Report

2 major / 2 minor

Summary. The paper introduces TierCheck, a tiered checkpointing system for fault tolerance in large language model training. It uses a three-tier design that keeps lightweight differential checkpoints in local and peer memory for fast localized recovery while asynchronously migrating heavyweight base checkpoints to remote persistent storage. The system claims to enforce strict global consistency across tiers without stalling training and to enable fast cluster-aware recovery. Evaluations on models up to 40 billion parameters report low training overhead, end-to-end checkpointing times under 10 seconds, and support for high-frequency checkpointing.

Significance. If the consistency guarantees and recovery performance hold under realistic failure models, TierCheck would address a practical bottleneck in large-scale distributed LLM training by aligning storage tiers with heterogeneous failure patterns. The tiered approach offers a concrete way to trade off persistence cost against recovery latency, which is relevant to current production training systems.

major comments (2)

[Abstract and §3 (Design)] Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.
[§5 (Evaluation)] §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.

minor comments (2)

[§3] Add a diagram in §3 that explicitly shows the state transitions and consistency invariants during asynchronous migration.
Clarify the notation for differential checkpoints versus base checkpoints in the text and any pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our consistency protocol and evaluation details.

read point-by-point responses

Referee: Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.

Authors: We agree that the manuscript would benefit from an explicit description of the consistency mechanism. In the revised version we will expand §3 with a new subsection that details the versioning scheme (each differential is immutably bound to a base-checkpoint version identifier), the migration log that records completion of asynchronous base persistence, and the recovery rule that selects the latest version pair for which both components are present. This protocol guarantees a consistent base+differential pair without requiring training to stall, as the hand-off occurs only after the base has been durably written. revision: yes
Referee: §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.

Authors: We acknowledge the need for greater methodological transparency. The revised §5 will include a dedicated experimental-setup subsection that specifies the injected failure models (single-GPU crashes, node failures, and simulated cluster-wide outages), the consistency-checking procedure (post-recovery state hashing against a golden checkpoint plus cross-tier version validation), the cluster sizes used (up to 128 GPUs), and the precise measurement methodology for overhead and end-to-end times. These additions will allow readers to evaluate the reported numbers against the claimed guarantees. revision: yes

Circularity Check

0 steps flagged

No circularity: system design with external evaluations, not a derivation chain

full rationale

This is an engineering system paper proposing TierCheck, a three-tier checkpointing design for LLM training that places lightweight differentials in local/peer memory and migrates base checkpoints asynchronously to remote storage while claiming strict global consistency. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims rest on reported evaluations with models up to 40B parameters rather than any reduction to the paper's own inputs by construction. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or uniqueness theorems that collapse the central argument. Score 0 is the appropriate finding per the guidelines for non-derivational system proposals.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about failure heterogeneity and the feasibility of maintaining consistency during asynchronous tier migration; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption LLM training failures form a heterogeneous spectrum from common GPU crashes to catastrophic cluster-wide outages that can be aligned with different storage tiers.
The three-tier design is motivated by and depends on this characterization of failures.
domain assumption Asynchronous migration of base checkpoints to remote storage can occur without violating strict global consistency or stalling training.
This is required for the claimed low-overhead persistence and fast recovery.

pith-pipeline@v0.9.0 · 5711 in / 1365 out tokens · 44843 ms · 2026-05-20T01:25:28.802925+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv(2023), arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. InProc. of EMNLP

work page 2017
[3]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. InProc. of NeurIPS

work page 2017
[4]

Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. 2026. All is Not Lost: LLM Recovery without Checkpoints. InProc. of EuroMLSys

work page 2026
[5]

James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Geffen, and Andrew Warfield. 2021. Using lightweight formal methods to validate a key-value storage node in Amazon S3. InProc. of ACM SOSP

work page 2021
[6]

J. Dean. 2009. Designs, lessons and advice from building large dis- tributed systems. Keynote talk at LADIS

work page 2009
[7]

Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems.Comput. Surveys34, 3 (2002), 375–408

work page 2002
[8]

Daniel Ford, François Labelle, Florentina I Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan

work page
[9]

Availability in Globally Distributed Storage Systems. InProc. of USENIX OSDI

work page
[10]

Swapnil Gandhi and Christos Kozyrakis. 2026. Sparse Checkpointing for Fast and Reliable MoE Training. InProc. of USENIX NSDI

work page 2026
[11]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 herd of models.arXiv(2024), arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. InProc. of NeurIPS

work page 2019
[13]

Zimeng Huang, Hao Nie, Haonan Jia, Bo Jiang, Junchen Guo, Jianyuan Lu, Rong Wen, Biao Lyu, Shunmin Zhu, and Xinbing Wang. 2025. FlowCheck: Decoupling Checkpointing and Training of Large-Scale Models. InProc. of EuroSys

work page 2025
[14]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

work page
[15]

MegaScale: Scaling large language model training to more than 10,000 GPUs. InProc. of USENIX NSDI

work page
[16]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochas- tic optimization. InProc. of ICLR

work page 2015
[17]

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Ba...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Yuanhao Li, Tianyuan Wu, Guancheng Li, Yanjie Song, and Shu Yin

work page
[19]

Portus: Efficient DNN checkpointing to persistent memory with zero-copy. InProc. of ICDCS

work page
[20]

Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. 2025. Universal Check- pointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism. In Proc. of USENIX ATC

work page 2025
[21]

Weijie Liu, Shengwei Li, Zhiquan Lai, Keshi Ge, Qiaoling Chen, Peng Sun, Dongsheng Li, and Kai Lu. 2026. AdaCheck: An Adaptive Check- pointing System for Efficient LLM Training with Redundancy Utiliza- tion. InProc. of USENIX FAST

work page 2026
[22]

Avinash Maurya, M Mustafa Rafique, Thierry Tonellot, Hussain J AlSalem, Franck Cappello, and Bogdan Nicolae. 2023. GPU-enabled asynchronous multi-level checkpoint caching and prefetching. InProc. of HPDC

work page 2023
[23]

Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy asynchronous checkpointing for large language models. InProc. of HPDC

work page 2024
[24]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

work page
[25]

Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Zhangqiang Ming, Yuchong Hu, Zhiyuan Luo, Patrick P. C. Lee, Yuan- hao Shu, Wenxiang Zhou, and Dan Feng. 2026. AsymCheck: Asym- metric Partitioned Checkpointing for Efficient Large Language Model Training. InProc. of ACM/IEEE DAC

work page 2026
[27]

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. InProc. of USENIX FAST

work page 2021
[28]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

work page
[29]

PyTorch: An Imperative Style, High-Performance Deep Learning Library. InProc. of NeurIPS

work page
[30]

PyTorch Team. 2024. Distributed Checkpoint (DCP) — PyTorch Tutorials.https://docs.pytorch.org/tutorials/recipes/distributed_ checkpoint_recipe.html

work page 2024
[31]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page
[32]

ZeRO: Memory optimizations toward training trillion parameter models. InProc. of SC

work page
[33]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD.arXiv preprint arXiv:1806.03822(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

work page
[35]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProc. of KDD. 13

work page
[36]

Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. SparCML: High-Performance Sparse Com- munication for Machine Learning. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM

work page 2019
[37]

Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, et al. 2026. Training LLMs with Fault Tolerant HSDP on 100,000 GPUs. arXiv preprint arXiv:2602.00277(2026)

work page arXiv 2026
[38]

Philip Schwan. 2003. Lustre: Building a file system for 1,000-node clusters. InProc. of Linux Symposium

work page 2003
[39]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv(2019), arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[40]

Foteini Strati, Michal Friedman, and Ana Klimovic. 2025. PCcheck: Persistent Concurrent Checkpointing for ML. InProc. of ACM ASPLOS

work page 2025
[41]

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. 2025. ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. InProc. of USENIX NSDI

work page 2025
[42]

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProc. of SOSP

work page 2023
[43]

Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. InProc. of USENIX OSDI

work page 2006
[44]

Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, and Shigang Li. 2023. Transom: An efficient fault-tolerant system for training LLMs.arXiv(2023), arXiv preprint arXiv:2310.10046

work page arXiv 2023
[45]

Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, and Kento Sato. 2024. An Efficient Checkpointing System for Large Machine Learning Model Training. InProc. of SC Workshops

work page 2024
[46]

Chenxuan Yao, Yuchong Hu, Feifan Liu, Zhengyu Liu, and Dan Feng

work page
[47]

LowDiff: Efficient Frequent Checkpointing via Low-Cost Differ- ential for High-Performance Distributed Training Systems. InProc. of SC

work page
[48]

Mi Zhang, Shujie Han, and Patrick P. C. Lee. 2019. SimEDC: A Sim- ulator for the Reliability Analysis of Erasure-Coded Data Centers. IEEE Transactions on Parallel and Distributed Systems30, 12 (2019), 2836–2848

work page 2019
[49]

Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. InProc. of ACM/IEEE ICSE

work page 2020
[50]

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. 2024. Swift: Expedited Failure Recovery for Large-Scale DNN Training.IEEE Transactions on Parallel and Distributed Systems 35, 9 (2024), 1644–1656. 14

work page 2024

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv(2023), arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. InProc. of EMNLP

work page 2017

[3] [3]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. InProc. of NeurIPS

work page 2017

[4] [4]

Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. 2026. All is Not Lost: LLM Recovery without Checkpoints. InProc. of EuroMLSys

work page 2026

[5] [5]

James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Geffen, and Andrew Warfield. 2021. Using lightweight formal methods to validate a key-value storage node in Amazon S3. InProc. of ACM SOSP

work page 2021

[6] [6]

J. Dean. 2009. Designs, lessons and advice from building large dis- tributed systems. Keynote talk at LADIS

work page 2009

[7] [7]

Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems.Comput. Surveys34, 3 (2002), 375–408

work page 2002

[8] [8]

Daniel Ford, François Labelle, Florentina I Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan

work page

[9] [9]

Availability in Globally Distributed Storage Systems. InProc. of USENIX OSDI

work page

[10] [10]

Swapnil Gandhi and Christos Kozyrakis. 2026. Sparse Checkpointing for Fast and Reliable MoE Training. InProc. of USENIX NSDI

work page 2026

[11] [11]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 herd of models.arXiv(2024), arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. InProc. of NeurIPS

work page 2019

[13] [13]

Zimeng Huang, Hao Nie, Haonan Jia, Bo Jiang, Junchen Guo, Jianyuan Lu, Rong Wen, Biao Lyu, Shunmin Zhu, and Xinbing Wang. 2025. FlowCheck: Decoupling Checkpointing and Training of Large-Scale Models. InProc. of EuroSys

work page 2025

[14] [14]

Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

work page

[15] [15]

MegaScale: Scaling large language model training to more than 10,000 GPUs. InProc. of USENIX NSDI

work page

[16] [16]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochas- tic optimization. InProc. of ICLR

work page 2015

[17] [17]

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Ba...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Yuanhao Li, Tianyuan Wu, Guancheng Li, Yanjie Song, and Shu Yin

work page

[19] [19]

Portus: Efficient DNN checkpointing to persistent memory with zero-copy. InProc. of ICDCS

work page

[20] [20]

Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. 2025. Universal Check- pointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism. In Proc. of USENIX ATC

work page 2025

[21] [21]

Weijie Liu, Shengwei Li, Zhiquan Lai, Keshi Ge, Qiaoling Chen, Peng Sun, Dongsheng Li, and Kai Lu. 2026. AdaCheck: An Adaptive Check- pointing System for Efficient LLM Training with Redundancy Utiliza- tion. InProc. of USENIX FAST

work page 2026

[22] [22]

Avinash Maurya, M Mustafa Rafique, Thierry Tonellot, Hussain J AlSalem, Franck Cappello, and Bogdan Nicolae. 2023. GPU-enabled asynchronous multi-level checkpoint caching and prefetching. InProc. of HPDC

work page 2023

[23] [23]

Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy asynchronous checkpointing for large language models. InProc. of HPDC

work page 2024

[24] [24]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

work page

[25] [25]

Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Zhangqiang Ming, Yuchong Hu, Zhiyuan Luo, Patrick P. C. Lee, Yuan- hao Shu, Wenxiang Zhou, and Dan Feng. 2026. AsymCheck: Asym- metric Partitioned Checkpointing for Efficient Large Language Model Training. InProc. of ACM/IEEE DAC

work page 2026

[27] [27]

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. InProc. of USENIX FAST

work page 2021

[28] [28]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

work page

[29] [29]

PyTorch: An Imperative Style, High-Performance Deep Learning Library. InProc. of NeurIPS

work page

[30] [30]

PyTorch Team. 2024. Distributed Checkpoint (DCP) — PyTorch Tutorials.https://docs.pytorch.org/tutorials/recipes/distributed_ checkpoint_recipe.html

work page 2024

[31] [31]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

work page

[32] [32]

ZeRO: Memory optimizations toward training trillion parameter models. InProc. of SC

work page

[33] [33]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD.arXiv preprint arXiv:1806.03822(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

work page

[35] [35]

DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProc. of KDD. 13

work page

[36] [36]

Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. SparCML: High-Performance Sparse Com- munication for Machine Learning. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM

work page 2019

[37] [37]

Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, et al. 2026. Training LLMs with Fault Tolerant HSDP on 100,000 GPUs. arXiv preprint arXiv:2602.00277(2026)

work page arXiv 2026

[38] [38]

Philip Schwan. 2003. Lustre: Building a file system for 1,000-node clusters. InProc. of Linux Symposium

work page 2003

[39] [39]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv(2019), arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[40] [40]

Foteini Strati, Michal Friedman, and Ana Klimovic. 2025. PCcheck: Persistent Concurrent Checkpointing for ML. InProc. of ACM ASPLOS

work page 2025

[41] [41]

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. 2025. ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. InProc. of USENIX NSDI

work page 2025

[42] [42]

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProc. of SOSP

work page 2023

[43] [43]

Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. InProc. of USENIX OSDI

work page 2006

[44] [44]

Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, and Shigang Li. 2023. Transom: An efficient fault-tolerant system for training LLMs.arXiv(2023), arXiv preprint arXiv:2310.10046

work page arXiv 2023

[45] [45]

Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, and Kento Sato. 2024. An Efficient Checkpointing System for Large Machine Learning Model Training. InProc. of SC Workshops

work page 2024

[46] [46]

Chenxuan Yao, Yuchong Hu, Feifan Liu, Zhengyu Liu, and Dan Feng

work page

[47] [47]

LowDiff: Efficient Frequent Checkpointing via Low-Cost Differ- ential for High-Performance Distributed Training Systems. InProc. of SC

work page

[48] [48]

Mi Zhang, Shujie Han, and Patrick P. C. Lee. 2019. SimEDC: A Sim- ulator for the Reliability Analysis of Erasure-Coded Data Centers. IEEE Transactions on Parallel and Distributed Systems30, 12 (2019), 2836–2848

work page 2019

[49] [49]

Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. InProc. of ACM/IEEE ICSE

work page 2020

[50] [50]

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. 2024. Swift: Expedited Failure Recovery for Large-Scale DNN Training.IEEE Transactions on Parallel and Distributed Systems 35, 9 (2024), 1644–1656. 14

work page 2024