pith. sign in

arxiv: 2506.15461 · v2 · submitted 2025-06-18 · 💻 cs.DC · cs.LG

All is Not Lost: LLM Recovery without Checkpoints

Pith reviewed 2026-05-19 09:10 UTC · model grok-4.3

classification 💻 cs.DC cs.LG
keywords LLM trainingfault tolerancepipeline parallelismcheckpoint-free recoverydecentralized trainingnode failuresweighted averaging
0
0 comments X

The pith

Weighted averaging of neighboring pipeline stages recovers LLM training from node failures without checkpoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in distributed LLM training prone to node churn, a failed pipeline stage can be replaced by a weighted average of its nearest neighbors to continue without full restarts. This CheckFree approach avoids the ongoing communication and storage costs of periodic checkpoints as well as the constant extra compute of redundancy. An extension, CheckFree+, adds out-of-order execution so that edge stages can be recovered by copying from neighbors, with only small extra storage needed for embedding layers. Experiments across LLaMa models from 124M to 1.5B parameters show faster wall-clock convergence at 5-10% failure rates, reaching up to 12% improvement over redundant computation. The result matters because it lowers the practical cost of training on decentralized or spot-instance resources where failures are routine.

Core claim

CheckFree substitutes a failing stage by weighted averaging of the closest neighboring stages and requires no additional computation or storage. CheckFree+ extends the method with out-of-order pipeline execution to tolerate crashes of the first and last stages by mimicking their behavior through neighbors and copying for (de-)embedding layers at relatively small storage cost. Evaluations on LLaMa models of sizes from 124M to 1.5B with varying failure frequencies show that at low and medium failure rates of 5-10%, both methods outperform checkpointing and redundant computation in convergence wall-clock time, achieving up to 12% improvement over redundant computation.

What carries the argument

Weighted averaging of parameters from adjacent pipeline stages, which substitutes for a failed stage while carrying forward the training signal without added overhead.

If this is right

  • Recovery occurs with zero extra computation or storage when no failures happen.
  • Wall-clock time to convergence improves by up to 12% over redundant computation at 5-10% failure rates.
  • The method applies across model sizes from 124 million to 1.5 billion parameters.
  • Edge-stage failures become tolerable with only minor storage for embedding layers via out-of-order execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could make training on highly variable spot instances or volunteer compute networks more economically viable by removing baseline overhead.
  • If the averaging effect generalizes, similar neighbor-based substitution might apply to other forms of model parallelism beyond strict pipelines.
  • Adaptive weighting that accounts for stage position or observed gradient magnitude could be tested as a refinement.
  • Direct comparison on non-LLaMa transformer variants or different tasks would clarify how broadly the convergence benefit holds.

Load-bearing premise

Averaging neighboring stages preserves enough training signal and convergence behavior to match or exceed checkpoint recovery without introducing bias for the tested model sizes and failure patterns.

What would settle it

Repeated recoveries via neighbor averaging at 5-10% failure rates on 124M-1.5B LLaMa models that result in slower convergence to target loss or final performance worse than checkpoint-based runs.

read the original abstract

Training LLMs on decentralized nodes or on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the transient churns of nodes due to failures and the operator's scheduling policies, leading to losing parts of the model (some layers). The conventional approaches to recover from failures is to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper we propose CheckFree, an efficient recovery method where a failing stage is substituted by weighted averaging of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of the first and last stages are mimicked by their neighboring ones, which allows CheckFree+ to recover them by copying the neighboring stages. To recover the (de-)embedding layers, CheckFree+ copies those layers in the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence wall-clock time, achieving up to 12% improvement over redundant computation. Both of our proposals can be ran via our code available at: https://github.com/gensyn-ai/CheckFree

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CheckFree, a checkpoint-free recovery technique for transient node failures in pipeline-parallel LLM training on decentralized or spot instances. Failed intermediate pipeline stages are replaced by weighted averaging of neighboring stages; CheckFree+ adds out-of-order execution and neighbor copying (plus small storage for embeddings) to also tolerate first- and last-stage failures. On LLaMA models from 124M to 1.5B parameters, the method reports up to 12% lower wall-clock time to convergence than checkpointing or redundant computation at 5-10% failure rates, with no extra compute or storage in the failure-free case.

Significance. If final model quality is shown to match checkpoint baselines, the approach would meaningfully lower the overhead of fault tolerance for large-model training on unreliable hardware, directly addressing a practical barrier to decentralized LLM training. The empirical evaluation across model scales and failure frequencies provides concrete evidence for the wall-clock claims; the open-source implementation is a further strength.

major comments (2)
  1. [Evaluation] Evaluation section (and abstract): the central claim that CheckFree/CheckFree+ reaches convergence faster than baselines rests on wall-clock time to a target loss, yet the manuscript does not report side-by-side final loss, perplexity, or downstream-task metrics comparing CheckFree runs against the checkpoint baseline at the same effective step count. Without this, the reported 12% improvement cannot be unambiguously attributed to superior recovery rather than differences in the converged model quality.
  2. [Method and Experiments] §4 (Method) and §5 (Experiments): the weighted-averaging substitution (and neighbor copying in CheckFree+) is presented as preserving training signal, but no analysis or ablation quantifies its effect on gradient flow across layer boundaries or under non-uniform failure patterns; this assumption is load-bearing for the claim that convergence behavior remains comparable.
minor comments (2)
  1. [Abstract] Abstract: the statement that CheckFree 'requires no additional computation or storage' should be qualified to note the small storage overhead for (de-)embedding layers in CheckFree+.
  2. [Experiments] The paper would benefit from explicit reporting of the number of random seeds, variance across runs, and the precise definition of 'convergence' (e.g., target loss threshold or fixed step count) used for the wall-clock measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where we agree that revisions are warranted and how we will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (and abstract): the central claim that CheckFree/CheckFree+ reaches convergence faster than baselines rests on wall-clock time to a target loss, yet the manuscript does not report side-by-side final loss, perplexity, or downstream-task metrics comparing CheckFree runs against the checkpoint baseline at the same effective step count. Without this, the reported 12% improvement cannot be unambiguously attributed to superior recovery rather than differences in the converged model quality.

    Authors: We agree that reporting final metrics at equivalent effective step counts would remove any ambiguity. Our current evaluation trains every method to the identical target loss and reports wall-clock time to that point. In the revised manuscript we will add a table comparing final loss and perplexity for CheckFree, CheckFree+, checkpointing, and redundancy after the same number of effective training steps (i.e., steps that actually update model parameters). Where space permits we will also include a downstream-task result (e.g., a GLUE subset) for the 1.5 B model. These additions will confirm that the observed speed-up stems from lower recovery overhead rather than differences in final model quality. revision: yes

  2. Referee: [Method and Experiments] §4 (Method) and §5 (Experiments): the weighted-averaging substitution (and neighbor copying in CheckFree+) is presented as preserving training signal, but no analysis or ablation quantifies its effect on gradient flow across layer boundaries or under non-uniform failure patterns; this assumption is load-bearing for the claim that convergence behavior remains comparable.

    Authors: We acknowledge that an explicit quantification of the transient effect on gradient flow would be valuable. In the revised version we will add an ablation that varies the averaging weights, records the resulting per-layer gradient norms during recovery, and plots the impact on subsequent convergence. We will also extend the failure model to non-uniform patterns (e.g., failures concentrated on early or late stages) and report the corresponding wall-clock and final-loss results. These experiments will directly test the robustness of the substitution mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical method and measured results

full rationale

The paper describes a recovery technique (weighted neighbor averaging for intermediate stages, out-of-order execution plus copying for edge stages) and reports wall-clock convergence improvements from direct experiments on 124M–1.5B LLaMA models at 5–10% failure rates. No equations, derivations, or fitted parameters are presented whose outputs are then relabeled as predictions. Claims rest on side-by-side runtime measurements against checkpointing and redundancy baselines rather than any self-definitional loop, self-citation load-bearing premise, or ansatz smuggled through prior work. The central performance numbers are therefore independent experimental outcomes, not reductions to the method's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that pipeline stages are sufficiently interchangeable via averaging to maintain training progress; no new free parameters, invented entities, or ad-hoc axioms beyond standard distributed training assumptions are introduced.

axioms (1)
  • domain assumption Neighboring stages in a pipeline-parallel setup can substitute for a failed stage via weighted averaging without substantially altering convergence dynamics.
    This premise underpins the recovery mechanism for intermediate stages and is invoked to justify performance parity or gains versus checkpointing.

pith-pipeline@v0.9.0 · 5872 in / 1285 out tokens · 38011 ms · 2026-05-19T09:10:43.950535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    Stacking as accelerated gradient descent.arXiv preprint arXiv:2403.04978,

    Naman Agarwal, Pranjal Awasthi, Satyen Kale, and Eric Zhao. Stacking as accelerated gradient descent.arXiv preprint arXiv:2403.04978,

  2. [2]

    Understanding robustness of transformers for image classification

    Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, and Andreas Veit. Understanding robustness of transformers for image classification. In2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 10211–10221. IEEE,

  3. [3]

    Emerging properties in self-supervised vision transformers

    doi: 10.1109/ICCV48922.2021.01007. URLhttps://doi.org/10.1109/ICCV48922.2021.01007. Nikolay Blagoev, Lydia Yiyu Chen, and Oguzhan Ersoy. Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.CoRR, abs/2502.19913,

  4. [4]

    Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

    doi: 10.48550/ARXIV.2502.19913. URLhttps://doi.org/10.48550/arXiv.2502.19913. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset,

  5. [5]

    Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu

    URL https://github.com/togethercomputer/RedPajama-Data. Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, and Jie Fu. Stacking your transformers: A closer look at model growth for efficient LLM pre-training. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, ...

  6. [6]

    URLhttp://papers.nips.cc/paper_files/ paper/2024/hash/143ea4a156ef64f32d4d905206cf32e1-Abstract-Conference.html. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Art...

  7. [7]

    The Llama 3 Herd of Models

    doi: 10.48550/ARXIV.2407.21783. URLhttps://doi.org/10.48550/arXiv.2407.21783. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? CoRR, abs/2305.07759,

  8. [8]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?arXiv preprint arXiv:2305.07759, 2023

    doi: 10.48550/ARXIV.2305.07759. URL https://doi.org/10.48550/ arXiv.2305.07759. Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding. In Lun-W...

  9. [9]

    URLhttps://doi.org/10.18653/v1/2024

    doi: 10.18653/V1/2024.ACL-LONG.681. URLhttps://doi.org/10.18653/v1/2024. acl-long.681. Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofP...

  10. [10]

    Super tiny language models.CoRR, abs/2405.14159,

    Dylan Hillier, Leon Guertler, Cheston Tan, Palaash Agrawal, Ruirui Chen, and Bobby Cheng. Super tiny language models.CoRR, abs/2405.14159,

  11. [11]

    Super tiny language models.CoRR, abs/2405.14159,

    doi: 10.48550/ARXIV.2405.14159. URL https://doi.org/10. 48550/arXiv.2405.14159. Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury. Oobleck: Resilient distributed training of large models using pipeline templates. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Sympo...

  12. [12]

    URL https://doi.org/10.1145/3600006.3613152

    doi: 10.1145/3600006.3613152. URL https://doi.org/10.1145/3600006.3613152. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

  13. [13]

    Yuhang Liang, Xinyi Li, Jie Ren, Ang Li, Bo Fang, and Jieyang Chen

    URL https://openreview.net/forum?id=H1eA7AEtvS. Yuhang Liang, Xinyi Li, Jie Ren, Ang Li, Bo Fang, and Jieyang Chen. Attnchecker: Highly-optimized fault tolerant attention for large language model training. InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2025, Las Vegas, NV, USA, March 1-5, 2...

  14. [14]

    URLhttps://doi.org/10.1145/3710848.3710870

    doi: 10.1145/3710848.3710870. URLhttps://doi.org/10.1145/3710848.3710870. Xiaolong Ma, Minghai Qin, Fei Sun, Zejiang Hou, Kun Yuan, Yi Xu, Yanzhi Wang, Yen-Kuang Chen, Rong Jin, and Yuan Xie. Effective model sparsification by scheduled grow-and-prune methods. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29,

  15. [15]

    Towards crowdsourced training of large neural networks using decentralized mixture-of-experts

    Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan- Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIP...

  16. [16]

    neurips.cc/paper/2020/hash/25ddc0f8c9d3e22e03d3076f98d83cb2-Abstract.html

    URL https://proceedings. neurips.cc/paper/2020/hash/25ddc0f8c9d3e22e03d3076f98d83cb2-Abstract.html. Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. SWARM parallelism: Training large models can be surprisingly communication-efficient. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarle...

  17. [17]

    Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, and Sanjiv Kumar

    URLhttps://proceedings.mlr.press/v202/ryabinin23a.html. Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, and Sanjiv Kumar. On the inductive bias of stacking towards improving reasoning. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advance...

  18. [18]

    URLhttp://papers.nips.cc/paper_files/ paper/2024/hash/837bc5db12f3d394d220815a7687340c-Abstract-Conference.html. 12 All is Not Lost: LLM Recovery without Checkpoints Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. R...

  19. [19]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    doi: 10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100. John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. Bamboo: Making preemptible instances resilient for affordable training of large dnns. In Mahesh Balakrishnan and Manya Ghobadi, editors,20th USENIX Symposi...

  20. [20]

    AndreasVeit,MichaelJ.Wilber,andSergeJ.Belongie

    URL https://www.usenix.org/conference/nsdi23/presentation/thorpe. AndreasVeit,MichaelJ.Wilber,andSergeJ.Belongie. Residualnetworksbehavelikeensemblesofrelativelyshallow networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neu...

  21. [21]

    neurips.cc/paper/2016/hash/37bc2f75bf1bcfe8450a1a41c200364c-Abstract.html

    URLhttps://proceedings. neurips.cc/paper/2016/hash/37bc2f75bf1bcfe8450a1a41c200364c-Abstract.html. Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, and Yida Wang. GEMINI: fast failure recovery in distributed training with in-memory checkpoints. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace...

  22. [22]

    URLhttps://doi.org/10.1145/3600006.3613145

    doi: 10.1145/3600006.3613145. URLhttps://doi.org/10.1145/3600006.3613145. Maurice Weber, Daniel Y. Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang. Redpajama:...

  23. [23]

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S

    URL http://papers.nips.cc/paper_files/paper/2024/ hash/d34497330b1fd6530f7afd86d0df9f76-Abstract-Datasets_and_Benchmarks_Track.html. Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights...

  24. [24]

    mlr.press/v162/wortsman22a.html

    URLhttps://proceedings. mlr.press/v162/wortsman22a.html. Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Ré, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural In...

  25. [25]

    Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra

    URL http://papers.nips.cc/paper_files/ paper/2022/hash/a37d615b61f999a5fa276adb14643476-Abstract-Conference.html. Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft& verify: Lossless large language model acceleration via self-speculative decoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings o...

  26. [26]

    OPT: Open Pre-trained Transformer Language Models

    doi: 10.18653/V1/2024.ACL-LONG.607. URL https: //doi.org/10.18653/v1/2024.acl-long.607. 13 All is Not Lost: LLM Recovery without Checkpoints Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit S...

  27. [27]

    Every parameter matters: Ensuring the convergence of federated learning with dynamic heterogeneous models reduction

    Hanhan Zhou, Tian Lan, Guru Venkataramani, and Wenbo Ding. Every parameter matters: Ensuring the convergence of federated learning with dynamic heterogeneous models reduction. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural...

  28. [28]

    14 All is Not Lost: LLM Recovery without Checkpoints A

    URL http://papers.nips.cc/paper_files/ paper/2023/hash/526356453b7301c9b29aa0533f62bdef-Abstract-Conference.html. 14 All is Not Lost: LLM Recovery without Checkpoints A. Reproducibility information This section describes relevant information for reproducing our results. A.1. Models We train three different model sizes, all of the LLaMa family. We provide ...

  29. [29]

    Size Small Medium Large Parameters 124M 500M 1.5B Elhoushi et al

    Table 4Model hyperparameters. Size Small Medium Large Parameters 124M 500M 1.5B Elhoushi et al. (2024) Dim 512 1024 2048 Heads 8 16 16 Layers 12 24 24 Stages 4 6 6 Context 512 1024 4096 Learning rate 6×10 −4 3×10 −4 3×10 −4 A.2. Optimizers All tests were performed with the Adam optimizer with no weight decay and betas (0.9,0.999) A.3. Datasets • TinyStori...