TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training
Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3
The pith
TierCheck uses a three-tier checkpointing design to deliver under-10-second recovery times and low overhead for large language model training up to 40 billion parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TierCheck aligns storage tiers with failure heterogeneity by keeping lightweight differential checkpoints in local and peer memory for fast localized recovery and asynchronously migrating heavyweight base checkpoints to remote persistent storage, all while preserving strict global consistency across tiers without stalling the training process or introducing recovery errors.
What carries the argument
The three-tier design that stores lightweight differential checkpoints in local and peer memory for rapid recovery and moves base checkpoints asynchronously to remote storage while enforcing global consistency.
If this is right
- Training runs can checkpoint at high frequency with end-to-end times under 10 seconds.
- Localized failures can be recovered from memory without touching remote storage.
- Overall training overhead from checkpointing stays low even at cluster scale.
- Cluster-aware restoration becomes possible after widespread outages.
Where Pith is reading between the lines
- The tiered placement could apply to other long-running distributed workloads that face mixed failure rates.
- Asynchronous migration may lower storage costs in environments where remote capacity is priced differently from memory.
- Further scaling tests beyond 40 billion parameters would show whether memory-tier capacity becomes a bottleneck.
Load-bearing premise
The system can keep strict global consistency across memory tiers and remote storage during asynchronous migration without stalling training or creating errors on recovery.
What would settle it
A run on a 40-billion-parameter model in which a failure during asynchronous checkpoint migration produces an inconsistent state or a recovery error that the system cannot resolve.
Figures
read the original abstract
Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TierCheck, a tiered checkpointing system for fault tolerance in large language model training. It uses a three-tier design that keeps lightweight differential checkpoints in local and peer memory for fast localized recovery while asynchronously migrating heavyweight base checkpoints to remote persistent storage. The system claims to enforce strict global consistency across tiers without stalling training and to enable fast cluster-aware recovery. Evaluations on models up to 40 billion parameters report low training overhead, end-to-end checkpointing times under 10 seconds, and support for high-frequency checkpointing.
Significance. If the consistency guarantees and recovery performance hold under realistic failure models, TierCheck would address a practical bottleneck in large-scale distributed LLM training by aligning storage tiers with heterogeneous failure patterns. The tiered approach offers a concrete way to trade off persistence cost against recovery latency, which is relevant to current production training systems.
major comments (2)
- [Abstract and §3 (Design)] Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.
- [§5 (Evaluation)] §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.
minor comments (2)
- [§3] Add a diagram in §3 that explicitly shows the state transitions and consistency invariants during asynchronous migration.
- Clarify the notation for differential checkpoints versus base checkpoints in the text and any pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our consistency protocol and evaluation details.
read point-by-point responses
-
Referee: Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.
Authors: We agree that the manuscript would benefit from an explicit description of the consistency mechanism. In the revised version we will expand §3 with a new subsection that details the versioning scheme (each differential is immutably bound to a base-checkpoint version identifier), the migration log that records completion of asynchronous base persistence, and the recovery rule that selects the latest version pair for which both components are present. This protocol guarantees a consistent base+differential pair without requiring training to stall, as the hand-off occurs only after the base has been durably written. revision: yes
-
Referee: §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.
Authors: We acknowledge the need for greater methodological transparency. The revised §5 will include a dedicated experimental-setup subsection that specifies the injected failure models (single-GPU crashes, node failures, and simulated cluster-wide outages), the consistency-checking procedure (post-recovery state hashing against a golden checkpoint plus cross-tier version validation), the cluster sizes used (up to 128 GPUs), and the precise measurement methodology for overhead and end-to-end times. These additions will allow readers to evaluate the reported numbers against the claimed guarantees. revision: yes
Circularity Check
No circularity: system design with external evaluations, not a derivation chain
full rationale
This is an engineering system paper proposing TierCheck, a three-tier checkpointing design for LLM training that places lightweight differentials in local/peer memory and migrates base checkpoints asynchronously to remote storage while claiming strict global consistency. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims rest on reported evaluations with models up to 40B parameters rather than any reduction to the paper's own inputs by construction. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or uniqueness theorems that collapse the central argument. Score 0 is the appropriate finding per the guidelines for non-derivational system proposals.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM training failures form a heterogeneous spectrum from common GPU crashes to catastrophic cluster-wide outages that can be aligned with different storage tiers.
- domain assumption Asynchronous migration of base checkpoints to remote storage can occur without violating strict global consistency or stalling training.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv(2023), arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. InProc. of EMNLP
work page 2017
-
[3]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. InProc. of NeurIPS
work page 2017
-
[4]
Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. 2026. All is Not Lost: LLM Recovery without Checkpoints. InProc. of EuroMLSys
work page 2026
-
[5]
James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Geffen, and Andrew Warfield. 2021. Using lightweight formal methods to validate a key-value storage node in Amazon S3. InProc. of ACM SOSP
work page 2021
-
[6]
J. Dean. 2009. Designs, lessons and advice from building large dis- tributed systems. Keynote talk at LADIS
work page 2009
-
[7]
Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems.Comput. Surveys34, 3 (2002), 375–408
work page 2002
-
[8]
Daniel Ford, François Labelle, Florentina I Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan
-
[9]
Availability in Globally Distributed Storage Systems. InProc. of USENIX OSDI
-
[10]
Swapnil Gandhi and Christos Kozyrakis. 2026. Sparse Checkpointing for Fast and Reliable MoE Training. InProc. of USENIX NSDI
work page 2026
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 herd of models.arXiv(2024), arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. InProc. of NeurIPS
work page 2019
-
[13]
Zimeng Huang, Hao Nie, Haonan Jia, Bo Jiang, Junchen Guo, Jianyuan Lu, Rong Wen, Biao Lyu, Shunmin Zhu, and Xinbing Wang. 2025. FlowCheck: Decoupling Checkpointing and Training of Large-Scale Models. InProc. of EuroSys
work page 2025
-
[14]
Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al
-
[15]
MegaScale: Scaling large language model training to more than 10,000 GPUs. InProc. of USENIX NSDI
-
[16]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochas- tic optimization. InProc. of ICLR
work page 2015
-
[17]
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Ba...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Yuanhao Li, Tianyuan Wu, Guancheng Li, Yanjie Song, and Shu Yin
-
[19]
Portus: Efficient DNN checkpointing to persistent memory with zero-copy. InProc. of ICDCS
-
[20]
Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. 2025. Universal Check- pointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism. In Proc. of USENIX ATC
work page 2025
-
[21]
Weijie Liu, Shengwei Li, Zhiquan Lai, Keshi Ge, Qiaoling Chen, Peng Sun, Dongsheng Li, and Kai Lu. 2026. AdaCheck: An Adaptive Check- pointing System for Efficient LLM Training with Redundancy Utiliza- tion. InProc. of USENIX FAST
work page 2026
-
[22]
Avinash Maurya, M Mustafa Rafique, Thierry Tonellot, Hussain J AlSalem, Franck Cappello, and Bogdan Nicolae. 2023. GPU-enabled asynchronous multi-level checkpoint caching and prefetching. InProc. of HPDC
work page 2023
-
[23]
Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy asynchronous checkpointing for large language models. InProc. of HPDC
work page 2024
-
[24]
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher
-
[25]
Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
Zhangqiang Ming, Yuchong Hu, Zhiyuan Luo, Patrick P. C. Lee, Yuan- hao Shu, Wenxiang Zhou, and Dan Feng. 2026. AsymCheck: Asym- metric Partitioned Checkpointing for Efficient Large Language Model Training. InProc. of ACM/IEEE DAC
work page 2026
-
[27]
Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. InProc. of USENIX FAST
work page 2021
-
[28]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala
-
[29]
PyTorch: An Imperative Style, High-Performance Deep Learning Library. InProc. of NeurIPS
-
[30]
PyTorch Team. 2024. Distributed Checkpoint (DCP) — PyTorch Tutorials.https://docs.pytorch.org/tutorials/recipes/distributed_ checkpoint_recipe.html
work page 2024
-
[31]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He
-
[32]
ZeRO: Memory optimizations toward training trillion parameter models. InProc. of SC
-
[33]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD.arXiv preprint arXiv:1806.03822(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He
-
[35]
DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProc. of KDD. 13
-
[36]
Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. SparCML: High-Performance Sparse Com- munication for Machine Learning. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM
work page 2019
- [37]
-
[38]
Philip Schwan. 2003. Lustre: Building a file system for 1,000-node clusters. InProc. of Linux Symposium
work page 2003
-
[39]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv(2019), arXiv preprint arXiv:1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[40]
Foteini Strati, Michal Friedman, and Ana Klimovic. 2025. PCcheck: Persistent Concurrent Checkpointing for ML. InProc. of ACM ASPLOS
work page 2025
-
[41]
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. 2025. ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. InProc. of USENIX NSDI
work page 2025
-
[42]
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProc. of SOSP
work page 2023
-
[43]
Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. InProc. of USENIX OSDI
work page 2006
- [44]
-
[45]
Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, and Kento Sato. 2024. An Efficient Checkpointing System for Large Machine Learning Model Training. InProc. of SC Workshops
work page 2024
-
[46]
Chenxuan Yao, Yuchong Hu, Feifan Liu, Zhengyu Liu, and Dan Feng
-
[47]
LowDiff: Efficient Frequent Checkpointing via Low-Cost Differ- ential for High-Performance Distributed Training Systems. InProc. of SC
-
[48]
Mi Zhang, Shujie Han, and Patrick P. C. Lee. 2019. SimEDC: A Sim- ulator for the Reliability Analysis of Erasure-Coded Data Centers. IEEE Transactions on Parallel and Distributed Systems30, 12 (2019), 2836–2848
work page 2019
-
[49]
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. InProc. of ACM/IEEE ICSE
work page 2020
-
[50]
Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. 2024. Swift: Expedited Failure Recovery for Large-Scale DNN Training.IEEE Transactions on Parallel and Distributed Systems 35, 9 (2024), 1644–1656. 14
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.