pith. sign in

arxiv: 2605.17821 · v1 · pith:ULXCJOW2new · submitted 2026-05-18 · 💻 cs.DC · cs.AI

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords checkpointingfault tolerancelarge language modelstiered storagedistributed trainingrecoveryLLM training
0
0 comments X

The pith

TierCheck uses a three-tier checkpointing design to deliver under-10-second recovery times and low overhead for large language model training up to 40 billion parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TierCheck as a system that places checkpoints according to different failure types in distributed LLM training. Lightweight differential checkpoints stay in local and peer memory for quick localized fixes, while full base checkpoints move asynchronously to remote storage for durability. The design keeps strict consistency across these tiers without pausing the training run. Evaluations confirm the approach supports frequent checkpointing and fast cluster-wide recovery while adding little training slowdown.

Core claim

TierCheck aligns storage tiers with failure heterogeneity by keeping lightweight differential checkpoints in local and peer memory for fast localized recovery and asynchronously migrating heavyweight base checkpoints to remote persistent storage, all while preserving strict global consistency across tiers without stalling the training process or introducing recovery errors.

What carries the argument

The three-tier design that stores lightweight differential checkpoints in local and peer memory for rapid recovery and moves base checkpoints asynchronously to remote storage while enforcing global consistency.

If this is right

  • Training runs can checkpoint at high frequency with end-to-end times under 10 seconds.
  • Localized failures can be recovered from memory without touching remote storage.
  • Overall training overhead from checkpointing stays low even at cluster scale.
  • Cluster-aware restoration becomes possible after widespread outages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tiered placement could apply to other long-running distributed workloads that face mixed failure rates.
  • Asynchronous migration may lower storage costs in environments where remote capacity is priced differently from memory.
  • Further scaling tests beyond 40 billion parameters would show whether memory-tier capacity becomes a bottleneck.

Load-bearing premise

The system can keep strict global consistency across memory tiers and remote storage during asynchronous migration without stalling training or creating errors on recovery.

What would settle it

A run on a 40-billion-parameter model in which a failure during asynchronous checkpoint migration produces an inconsistent state or a recovery error that the system cannot resolve.

Figures

Figures reproduced from arXiv: 2605.17821 by Feng Jiang, Lichen Pan, Nannan Zhao, Patrick P. C. Lee, Shujie Han, Xiaonan Zhao, Xiao Zhang, Zhijie Huang.

Figure 1
Figure 1. Figure 1: Architectural overview of TierCheck. after a failure. Its key techniques include: (i) global consen￾sus on latest checkpoint, which identifies the latest globally recoverable version across ranks; (ii) cluster-aware check￾point loading, which maps checkpoint shards to the current cluster topology and avoids unnecessary data movement; and (iii) fused multi-step differential checkpoint replaying, which recon… view at source ↗
Figure 2
Figure 2. Figure 2: Adaptive cross-tier transmission without stalling fore￾ground training. 3.3 Checkpoint Retrieval Design goal. The retrieval path must prioritize both recov￾ery speed and state consistency. Upon failure, TierCheck must identify the latest globally consistent state before re￾suming from the fastest surviving storage tier. To quantify checkpoint freshness, TierCheck defines a checkpoint ver￾sion by its corres… view at source ↗
Figure 3
Figure 3. Figure 3: Availability of base-checkpoint anchors across different storage tiers under heterogeneous failures. • Local-anchor recovery (𝑆1 = 1). This category covers states (1, 0, 0), (1, 1, 0), (1, 0, 1), and (1, 1, 1). When a soft￾ware failure terminates training while the host node re￾mains healthy, TierCheck restores the base checkpoint directly from the local volatile copy, bypassing slower tiers entirely. The … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of naive and watermark-driven global recla￾mation. newest locally visible state is incorrect under heterogeneous failures, as local recency provides no guarantee of cluster￾wide redundancy. Reclamation must thus be governed by global recoverability rather than per-node progress. Watermark-driven global reclamation. TierCheck re￾solves this synchronization challenge by deferring garbage collectio… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of native and fused multi-step differential checkpoint replaying. To eliminate these overheads, TierCheck introducesfused multi-step differential checkpoint replaying. Rather than re￾playing the entire historical chain in a single monolithic pass, TierCheck streams the recovered differential checkpoints and replays them in bounded batches. Each replay batch contains 𝑁 consecutive differential ch… view at source ↗
Figure 6
Figure 6. Figure 6: (Exp#1) Average training time per iteration. TierCheck Gemini DataStates-LLM CheckFreq GPT2 20B BERT 20B 0 10 20 30 Checkpointing time (s) 3.1 3.2 16.7 12.9 13.8 11.5 19.6 20.0 [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Exp#2) Checkpointing time. GPT2 20B BERT 20B 0 10 20 30 Ckpt. frequency (iterations) 5 5 11 12 11 11 21 22 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: (Exp#5) Scalability of dif￾ferent model sizes. DP=16 (ZeRO-3) DP=4 PP=4 DP=4 TP=4 PP=4 TP=4 DP=4 TP/PP=2 0 1 2 3 Training time (s) No Ckpt. TierCheck [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 13
Figure 13. Figure 13: (Exp#9) Storage overhead. it 14.4× slower than TierCheck. This performance gain is attributed to the fused operator’s ability to consolidate mul￾tiple optimizer updates into a single pass, thereby mitigating redundant memory I/O and cross-rank communication syn￾chronizations. (Exp#8) Convergence accuracy. To verify algorithmic cor￾rectness without the prohibitive cost of 40B-scale end-to-end training, we … view at source ↗
read the original abstract

Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TierCheck, a tiered checkpointing system for fault tolerance in large language model training. It uses a three-tier design that keeps lightweight differential checkpoints in local and peer memory for fast localized recovery while asynchronously migrating heavyweight base checkpoints to remote persistent storage. The system claims to enforce strict global consistency across tiers without stalling training and to enable fast cluster-aware recovery. Evaluations on models up to 40 billion parameters report low training overhead, end-to-end checkpointing times under 10 seconds, and support for high-frequency checkpointing.

Significance. If the consistency guarantees and recovery performance hold under realistic failure models, TierCheck would address a practical bottleneck in large-scale distributed LLM training by aligning storage tiers with heterogeneous failure patterns. The tiered approach offers a concrete way to trade off persistence cost against recovery latency, which is relevant to current production training systems.

major comments (2)
  1. [Abstract and §3 (Design)] Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.
  2. [§5 (Evaluation)] §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.
minor comments (2)
  1. [§3] Add a diagram in §3 that explicitly shows the state transitions and consistency invariants during asynchronous migration.
  2. Clarify the notation for differential checkpoints versus base checkpoints in the text and any pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our consistency protocol and evaluation details.

read point-by-point responses
  1. Referee: Abstract and §3 (Design): The central claim that TierCheck 'ensures strict global consistency across tiers without stalling training' while performing asynchronous base-checkpoint migration is load-bearing, yet no explicit protocol (versioning, migration log, two-phase commit, or atomic hand-off) is described that would guarantee a recovering node can always obtain a consistent base+differential pair if a fault occurs mid-migration. Without this mechanism the 'strict global consistency' guarantee cannot be verified.

    Authors: We agree that the manuscript would benefit from an explicit description of the consistency mechanism. In the revised version we will expand §3 with a new subsection that details the versioning scheme (each differential is immutably bound to a base-checkpoint version identifier), the migration log that records completion of asynchronous base persistence, and the recovery rule that selects the latest version pair for which both components are present. This protocol guarantees a consistent base+differential pair without requiring training to stall, as the hand-off occurs only after the base has been durably written. revision: yes

  2. Referee: §5 (Evaluation): The reported results (models up to 40B parameters, checkpointing time <10 s, low overhead) are presented without details on the failure models injected, how global consistency was checked during or after recovery, the cluster size, or the exact experimental methodology. These omissions prevent assessment of whether the numbers support the consistency and performance claims.

    Authors: We acknowledge the need for greater methodological transparency. The revised §5 will include a dedicated experimental-setup subsection that specifies the injected failure models (single-GPU crashes, node failures, and simulated cluster-wide outages), the consistency-checking procedure (post-recovery state hashing against a golden checkpoint plus cross-tier version validation), the cluster sizes used (up to 128 GPUs), and the precise measurement methodology for overhead and end-to-end times. These additions will allow readers to evaluate the reported numbers against the claimed guarantees. revision: yes

Circularity Check

0 steps flagged

No circularity: system design with external evaluations, not a derivation chain

full rationale

This is an engineering system paper proposing TierCheck, a three-tier checkpointing design for LLM training that places lightweight differentials in local/peer memory and migrates base checkpoints asynchronously to remote storage while claiming strict global consistency. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims rest on reported evaluations with models up to 40B parameters rather than any reduction to the paper's own inputs by construction. The work is self-contained against external benchmarks and contains no load-bearing self-citations, ansatzes, or uniqueness theorems that collapse the central argument. Score 0 is the appropriate finding per the guidelines for non-derivational system proposals.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about failure heterogeneity and the feasibility of maintaining consistency during asynchronous tier migration; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption LLM training failures form a heterogeneous spectrum from common GPU crashes to catastrophic cluster-wide outages that can be aligned with different storage tiers.
    The three-tier design is motivated by and depends on this characterization of failures.
  • domain assumption Asynchronous migration of base checkpoints to remote storage can occur without violating strict global consistency or stalling training.
    This is required for the claimed low-overhead persistence and fast recovery.

pith-pipeline@v0.9.0 · 5711 in / 1365 out tokens · 44843 ms · 2026-05-20T01:25:28.802925+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv(2023), arXiv preprint arXiv:2303.08774

  2. [2]

    Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. InProc. of EMNLP

  3. [3]

    Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. InProc. of NeurIPS

  4. [4]

    Nikolay Blagoev, Oğuzhan Ersoy, and Lydia Yiyu Chen. 2026. All is Not Lost: LLM Recovery without Checkpoints. InProc. of EuroMLSys

  5. [5]

    James Bornholt, Rajeev Joshi, Vytautas Astrauskas, Brendan Cully, Bernhard Kragl, Seth Markle, Kyle Sauri, Drew Schleit, Grant Slatton, Serdar Tasiran, Jacob Van Geffen, and Andrew Warfield. 2021. Using lightweight formal methods to validate a key-value storage node in Amazon S3. InProc. of ACM SOSP

  6. [6]

    J. Dean. 2009. Designs, lessons and advice from building large dis- tributed systems. Keynote talk at LADIS

  7. [7]

    Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems.Comput. Surveys34, 3 (2002), 375–408

  8. [8]

    Daniel Ford, François Labelle, Florentina I Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan

  9. [9]

    Availability in Globally Distributed Storage Systems. InProc. of USENIX OSDI

  10. [10]

    Swapnil Gandhi and Christos Kozyrakis. 2026. Sparse Checkpointing for Fast and Reliable MoE Training. InProc. of USENIX NSDI

  11. [11]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The Llama 3 herd of models.arXiv(2024), arXiv preprint arXiv:2407.21783

  12. [12]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. InProc. of NeurIPS

  13. [13]

    Zimeng Huang, Hao Nie, Haonan Jia, Bo Jiang, Junchen Guo, Jianyuan Lu, Rong Wen, Biao Lyu, Shunmin Zhu, and Xinbing Wang. 2025. FlowCheck: Decoupling Checkpointing and Training of Large-Scale Models. InProc. of EuroSys

  14. [14]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

  15. [15]

    MegaScale: Scaling large language model training to more than 10,000 GPUs. InProc. of USENIX NSDI

  16. [16]

    Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochas- tic optimization. InProc. of ICLR

  17. [17]

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Ba...

  18. [18]

    Yuanhao Li, Tianyuan Wu, Guancheng Li, Yanjie Song, and Shu Yin

  19. [19]

    Portus: Efficient DNN checkpointing to persistent memory with zero-copy. InProc. of ICDCS

  20. [20]

    Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, and Minjia Zhang. 2025. Universal Check- pointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism. In Proc. of USENIX ATC

  21. [21]

    Weijie Liu, Shengwei Li, Zhiquan Lai, Keshi Ge, Qiaoling Chen, Peng Sun, Dongsheng Li, and Kai Lu. 2026. AdaCheck: An Adaptive Check- pointing System for Efficient LLM Training with Redundancy Utiliza- tion. InProc. of USENIX FAST

  22. [22]

    Avinash Maurya, M Mustafa Rafique, Thierry Tonellot, Hussain J AlSalem, Franck Cappello, and Bogdan Nicolae. 2023. GPU-enabled asynchronous multi-level checkpoint caching and prefetching. InProc. of HPDC

  23. [23]

    Avinash Maurya, Robert Underwood, M Mustafa Rafique, Franck Cap- pello, and Bogdan Nicolae. 2024. DataStates-LLM: Lazy asynchronous checkpointing for large language models. InProc. of HPDC

  24. [24]

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher

  25. [25]

    Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843 (2016)

  26. [26]

    Zhangqiang Ming, Yuchong Hu, Zhiyuan Luo, Patrick P. C. Lee, Yuan- hao Shu, Wenxiang Zhou, and Dan Feng. 2026. AsymCheck: Asym- metric Partitioned Checkpointing for Efficient Large Language Model Training. InProc. of ACM/IEEE DAC

  27. [27]

    Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. CheckFreq: Frequent, Fine-Grained DNN Checkpointing. InProc. of USENIX FAST

  28. [28]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil- amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

  29. [29]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library. InProc. of NeurIPS

  30. [30]

    PyTorch Team. 2024. Distributed Checkpoint (DCP) — PyTorch Tutorials.https://docs.pytorch.org/tutorials/recipes/distributed_ checkpoint_recipe.html

  31. [31]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  32. [32]

    ZeRO: Memory optimizations toward training trillion parameter models. InProc. of SC

  33. [33]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD.arXiv preprint arXiv:1806.03822(2018)

  34. [34]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  35. [35]

    DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. InProc. of KDD. 13

  36. [36]

    Cedric Renggli, Saleh Ashkboos, Mehdi Aghagolzadeh, Dan Alistarh, and Torsten Hoefler. 2019. SparCML: High-Performance Sparse Com- munication for Machine Learning. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM

  37. [37]

    Omkar Salpekar, Rohan Varma, Kenny Yu, Vladimir Ivanov, Yang Wang, Ahmed Sharif, Min Si, Shawn Xu, Feng Tian, Shengbao Zheng, et al. 2026. Training LLMs with Fault Tolerant HSDP on 100,000 GPUs. arXiv preprint arXiv:2602.00277(2026)

  38. [38]

    Philip Schwan. 2003. Lustre: Building a file system for 1,000-node clusters. InProc. of Linux Symposium

  39. [39]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv(2019), arXiv preprint arXiv:1909.08053

  40. [40]

    Foteini Strati, Michal Friedman, and Ana Klimovic. 2025. PCcheck: Persistent Concurrent Checkpointing for ML. InProc. of ACM ASPLOS

  41. [41]

    Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, and Chuan Wu. 2025. ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development. InProc. of USENIX NSDI

  42. [42]

    Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, TS Eu- gene Ng, and Yida Wang. 2023. Gemini: Fast failure recovery in distributed training with in-memory checkpoints. InProc. of SOSP

  43. [43]

    Sage A Weil, Scott A Brandt, Ethan L Miller, Darrell DE Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. InProc. of USENIX OSDI

  44. [44]

    Baodong Wu, Lei Xia, Qingping Li, Kangyu Li, Xu Chen, Yongqiang Guo, Tieyao Xiang, Yuheng Chen, and Shigang Li. 2023. Transom: An efficient fault-tolerant system for training LLMs.arXiv(2023), arXiv preprint arXiv:2310.10046

  45. [45]

    Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, and Kento Sato. 2024. An Efficient Checkpointing System for Large Machine Learning Model Training. InProc. of SC Workshops

  46. [46]

    Chenxuan Yao, Yuchong Hu, Feifan Liu, Zhengyu Liu, and Dan Feng

  47. [47]

    LowDiff: Efficient Frequent Checkpointing via Low-Cost Differ- ential for High-Performance Distributed Training Systems. InProc. of SC

  48. [48]

    Mi Zhang, Shujie Han, and Patrick P. C. Lee. 2019. SimEDC: A Sim- ulator for the Reliability Analysis of Erasure-Coded Data Centers. IEEE Transactions on Parallel and Distributed Systems30, 12 (2019), 2836–2848

  49. [49]

    Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An empirical study on program failures of deep learning jobs. InProc. of ACM/IEEE ICSE

  50. [50]

    Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, and Chuan Wu. 2024. Swift: Expedited Failure Recovery for Large-Scale DNN Training.IEEE Transactions on Parallel and Distributed Systems 35, 9 (2024), 1644–1656. 14