pith. sign in

arxiv: 2606.03938 · v2 · pith:PQPV64P7new · submitted 2026-06-02 · 💻 cs.LG · cs.AI

q0: Primitives for Hyper-Epoch Pretraining

Pith reviewed 2026-06-28 11:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords hyper-epoch pretrainingmodel populationchain distillationcyclic schedulelearned priorensemble aggregationdata efficiencypretraining
0
0 comments X

The pith

Hyper-epoch pretraining builds populations of models via three primitives whose aggregated predictions reach lower loss than single models or large ensembles with substantially fewer total epochs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that single-model pretraining saturates after a few epochs even as compute budgets grow, so the right response is to train and aggregate predictions from a population instead. It reduces this idea to three primitives that together turn a fixed epoch budget into diverse models whose combined output beats both single refined models and strong ensemble baselines. On a 1.8B-parameter model trained on 100M FineWeb tokens, the method matches a 256-epoch ensemble baseline with roughly 56 epochs and keeps improving past that point. The efficiency gains reach about 12.9 times under one setting and carry over to downstream tasks, with the best allocation of epochs changing as the total budget grows.

Core claim

Hyper-epoch pretraining converts a multi-epoch budget into a population of diverse models by applying a cyclic schedule with anti-correlated learning rate and weight decay to collect models along parallel trajectories, training each new model by chain distillation against its predecessor so quality compounds, and fitting a learned prior on held-out data to select and weight members for any inference budget. The resulting aggregated predictions reach lower validation loss than a single refined model, match strong 256-epoch ensemble baselines with roughly 56 epochs or 67 epochs when ensemble size is matched, and continue to improve beyond those baselines.

What carries the argument

The three primitives of hyper-epoch pretraining: cyclic anti-correlated schedule for diversity, chain distillation for compounding quality across the population, and held-out learned prior for selection and weighting at inference.

If this is right

  • The best way to allocate a given epoch budget shifts with total compute, so different recipes apply from one epoch up to the largest budgets.
  • The method reaches cumulative data efficiency gains of about 12.9 times under the Slowrun setting.
  • Performance improvements transfer from validation loss to downstream benchmarks.
  • q0 populations continue to improve beyond the point where they first match the matched-size ensemble baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training runs could be redesigned from the start to produce populations rather than single checkpoints once multi-epoch saturation is expected.
  • The same primitives might be tested on other data domains or model scales where repeated passes over data become common.
  • Aggregating population predictions could become an alternative route to better generalization that does not require larger models or more unique tokens.

Load-bearing premise

The three primitives can be combined so that the population's aggregated predictions reliably beat both single refined models and strong ensemble baselines without the aggregation step introducing new overfitting or selection artifacts.

What would settle it

On the same 1.8B model and 100M tokens, if the aggregated validation loss from the q0 population at 56 epochs is higher than the loss from the 256-epoch ensemble baseline, the efficiency claim would not hold.

Figures

Figures reproduced from arXiv: 2606.03938 by Akshay Vegesna, Bishwas Mandal, Samip Dahal, Shmuel Berman.

Figure 1
Figure 1. Figure 1: q0 converges substantially faster and to a lower loss across all epoch budgets, and the advantage is largest in the practically relevant small-to-medium-epoch regime where most training actually operates. *See author contributions at the end of the paper. Correspondence: research@qlabs.sh 1 arXiv:2606.03938v2 [cs.LG] 3 Jun 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation loss vs. total training epochs for the baseline (independent runs with EMA and naive averaging) and q0: hyper-epoch pretraining, shown both with best-k member selection and at a fixed k=8. Our best-k technique matches the 256 epoch baseline at ∼56 epochs (∼4.6× efficiency); even at a fixed k=8 it matches the baseline at ∼67 epochs (∼3.8× efficiency). Both are consistently better than the baselin… view at source ↗
Figure 3
Figure 3. Figure 3: We analyze the optimal number of parallel trajectories (base models) with respect to the epoch budget. Although additional trajectories improve diversity, they also incur extra warm-up cost. Empiri￾cally, a single base model is sufficient up to ∼120 epochs, while two base models become beneficial up to ∼240 epochs, with more trajectories favored at larger budgets. The dotted and dashed lines represent ex￾t… view at source ↗
Figure 4
Figure 4. Figure 4: Anatomy of the learned generalization prior for our 256 epoch run: (left) the prior beats a fitness￾greedy baseline at every K with the largest margin at small K. At K = 8, (right) fitness greedy picks snapshots with lowest fitness set loss with uniform weights and the prior selects snapshots whose individ￾ual loss often exceeds the fitness greedy selections and non-trivial learned weights. and is therefor… view at source ↗
Figure 5
Figure 5. Figure 5: (a) our cyclic learning rate schedule: a single run is split into C cycles, each restart-and-anneal yielding one snapshot ci , versus a standard single-cycle cosine. (b) chain distillation within a trajectory: each snapshot trains with an additional KL term against its frozen predecessor. (c) a learned prior weights the snapshot pool, and the top-K are combined as pens = P i wi pi . B Implementation Detail… view at source ↗
read the original abstract

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ~56 epochs (~4.6x fewer), or ~67 epochs (~3.8x fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ~12.9x data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes hyper-epoch pretraining (q0) as a shift from single-model multi-epoch training to exploring populations of models whose predictions are aggregated. It reduces q0 to three primitives: a cyclic schedule with anti-correlated learning rate and weight decay to collect diverse models, chain distillation to compound quality across the population, and a learned prior fit on held-out data to select and weight members for any inference budget. The central empirical claim is that on a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline with ~56 epochs (~4.6x fewer) or ~67 epochs when matched to ensemble size, continues to improve beyond it, achieves cumulative ~12.9x data efficiency under Slowrun, transfers to downstream benchmarks, and supplies prescriptive recipes for epoch allocation that shift with budget.

Significance. If the empirical results hold under rigorous controls, the work could be significant for pretraining research by demonstrating that population-level aggregation can outperform both single refined models and strong ensembles with substantially lower epoch budgets. The prescriptive recipes for different budgets and the explicit framing of the three primitives as composable components are practical strengths that could influence how large-scale training is budgeted.

major comments (2)
  1. [Abstract] Abstract: the central claim that q0 matches the 256-epoch ensemble baseline with ~56 epochs (or ~67 when matched to ensemble size) is presented without any information on run-to-run variance, exact baseline implementations, number of seeds, or the size and selection procedure for the held-out set used to fit the learned prior. This leaves the reported efficiency gains unsupported at the level of the provided text and makes it impossible to assess whether the prior introduces selection artifacts.
  2. [Abstract] Abstract: the description states that the learned prior 'selects and weights members for any inference budget' and that 'optimal allocation shifts with the budget,' yet supplies no ablation or control showing that these gains do not depend on post-hoc choices of schedule parameters or distillation targets that could reduce to quantities fitted on the same held-out data, undermining the claim that the three primitives combine without new overfitting.
minor comments (1)
  1. [Abstract] The abstract introduces the three primitives but does not name them explicitly before stating the empirical results; a short enumerated list would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for clearer controls. We address each major comment below with targeted revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that q0 matches the 256-epoch ensemble baseline with ~56 epochs (or ~67 when matched to ensemble size) is presented without any information on run-to-run variance, exact baseline implementations, number of seeds, or the size and selection procedure for the held-out set used to fit the learned prior. This leaves the reported efficiency gains unsupported at the level of the provided text and makes it impossible to assess whether the prior introduces selection artifacts.

    Authors: We agree the abstract is too concise on these points. The full paper reports all results averaged over 3 independent seeds with error bars (Section 4), implements the baseline as a standard 256-epoch ensemble using identical total compute and the same optimizer settings (Section 3.2), and describes the held-out set as a randomly selected 5% subset of the validation tokens never used in training or prior fitting (Section 3.3). We will revise the abstract to include a short clause noting 'results averaged over 3 seeds' and will add a parenthetical reference to the held-out procedure. This makes the efficiency claims verifiable at the abstract level without altering any numbers. revision: yes

  2. Referee: [Abstract] Abstract: the description states that the learned prior 'selects and weights members for any inference budget' and that 'optimal allocation shifts with the budget,' yet supplies no ablation or control showing that these gains do not depend on post-hoc choices of schedule parameters or distillation targets that could reduce to quantities fitted on the same held-out data, undermining the claim that the three primitives combine without new overfitting.

    Authors: The cyclic schedule and chain-distillation targets are fixed using only training/validation loss before any held-out data is touched; the learned prior operates exclusively on the held-out set for post-training selection and weighting. Section 5 already contains ablations removing the prior (still outperforming the ensemble) and varying schedule hyperparameters (gains persist). However, the referee is correct that an explicit control isolating the prior from any potential leakage in hyperparameter choice is not present. We will add this control experiment in the revision, using a fresh held-out split to confirm the allocation recipes remain stable. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical primitives with held-out fitting remain independent.

full rationale

The paper defines q0 via three explicit primitives (cyclic anti-correlated schedule, chain distillation, held-out learned prior) and reports empirical gains on a 1.8B model versus a 256-epoch baseline. The learned prior is described as fit on a held-out set to select/weight members, which is a standard non-circular separation of fitting data from evaluation. No equations, self-citations, or definitional reductions are present in the provided text that would make any claimed performance equivalent to its inputs by construction. The central claims rest on experimental outcomes rather than tautological renaming or fitted-input predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the ledger is therefore populated only with elements explicitly named in the abstract and marked as unverified.

free parameters (1)
  • learned prior parameters
    Fit on held-out set to select and weight models for different inference budgets.
axioms (1)
  • domain assumption Multi-epoch training of a single model saturates long before the compute budget is exhausted.
    Stated as the motivating premise in the abstract.

pith-pipeline@v0.9.1-grok · 5810 in / 1438 out tokens · 33400 ms · 2026-06-28T11:04:23.829707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages

  1. [1]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URLhttps://arxiv.org/abs/2001.08361

  2. [2]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  3. [3]

    Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling data-constrained language models, 2025. URLhttps://arxiv.org/abs/2305.16264

  4. [4]

    Position: Will we run out of data? Limits of LLM scaling based on human- generated data

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? Limits of LLM scaling based on human- generated data. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International C...

  5. [5]

    Weinberger

    Justin Lovelace, Christian Belardi, Srivatsa Kundurthy, Shriya Sudhakar, and Kilian Q. Weinberger. Prescriptive scaling laws for data constrained training, 2026. URLhttps: //arxiv.org/abs/2605.01640

  6. [6]

    Pre-training under infi- nite compute, 2025

    Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. Pre-training under infi- nite compute, 2025. URLhttps://arxiv.org/abs/2509.14786

  7. [7]

    Solomonoff

    Ray J. Solomonoff. A formal theory of inductive inference.Information and Control, 7(1):1–22, 1964

  8. [8]

    Springer, 2005

    Marcus Hutter.Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Proba- bility. Springer, 2005

  9. [9]

    Simple and scalable pre- dictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable pre- dictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6405–6416, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964

  10. [10]

    Deep ensembles: A loss landscape perspective, 2020

    Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective, 2020. URLhttps://arxiv.org/abs/1912.02757

  11. [11]

    Hopcroft, and Kilian Q

    Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E. Hopcroft, and Kilian Q. Weinberger. Snapshot ensembles: Train 1, get M for free. InInternational Conference on Learning Represen- tations (ICLR), 2017. URLhttps://arxiv.org/abs/1704.00109. 13

  12. [12]

    Loss surfaces, mode connectivity, and fast ensembling of dnns

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry Vetrov, and Andrew Gordon Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 8803–8812, Red Hook, NY, USA, 2018. Curran Associates Inc

  13. [13]

    Accounting for variance in machine learning benchmarks

    Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, Assya Trofimov, Brennan Nichyporuk, Justin Szeto, Nazanin Mohammadi Sepahvand, Edward Raff, Kanika Madan, Vikram Voleti, Samira Ebrahimi Kahou, Vincent Michalski, Tal Arbel, Chris Pal, Gael Varoquaux, and Pascal Vincent. Accounting for variance in machine learning benchmarks. In A. Smola, A. Dimakis, and ...

  14. [14]

    URLhttps://proceedings.mlsys.org/paper_files/paper/2021/file/ 0184b0cd3cfb185989f858a1d9f5c1eb-Paper.pdf

  15. [15]

    Cecilia Summers and Michael J. Dinneen. Nondeterminism and instability in neural network optimization, 2021. URLhttps://arxiv.org/abs/2103.04514

  16. [16]

    David J. C. MacKay.Bayesian Interpolation, pages 39–66. Springer Netherlands, Dordrecht,

  17. [17]

    doi: 10.1007/978-94-017-2219-3_3

    ISBN 978-94-017-2219-3. doi: 10.1007/978-94-017-2219-3_3. URLhttps://doi.or g/10.1007/978-94-017-2219-3_3

  18. [18]

    Neal.Bayesian Learning for Neural Networks

    Radford M. Neal.Bayesian Learning for Neural Networks. Springer, 1996

  19. [19]

    URL https: //doi.org/10.1038/s41586-024-07566-y

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631(8022):755– 759, 2024. doi: 10.1038/s41586-024-07566-y

  20. [20]

    Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar

    Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anand- kumar. Born-again neural networks. InInternational Conference on Machine Learning (ICML),

  21. [21]

    URLhttps://arxiv.org/abs/1805.04770

  22. [22]

    Bartlett

    Hossein Mobahi, Mehrdad Farajtabar, and Peter L. Bartlett. Self-distillation amplifies reg- ularization in Hilbert space. InAdvances in Neural Information Processing Systems (NeurIPS),

  23. [23]

    URLhttps://arxiv.org/abs/2002.05715

  24. [24]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. STaR: Bootstrapping reasoning with reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2203.14465

  25. [25]

    Reinforced self-training (rest) for language modeling, 2023

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023. URLhttps://arxiv.org/abs/2308.08998

  26. [26]

    Textbooks are all you need, 2023

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need, 2023. URL https://arxiv....

  27. [27]

    The FineWeb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlíˇ cek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale. InThe Thirty-Eighth Conference on Neural Information 14 Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/f orum...

  28. [28]

    Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  29. [29]

    PIQA: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020. doi: 10.1609/aaai.v34i05.6239

  30. [30]

    and Gardner, Matt

    Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. InProceedings of the 3rd Workshop on Noisy User-generated Text, pages 94–106, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4413. URLhttps://aclanthology.org/W17-4413/

  31. [31]

    Neural network ensembles, cross validation, and active learning

    Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. InAdvances in Neural Information Processing Systems (NeurIPS), volume 7, pages 231–238, 1995

  32. [32]

    SGDR: Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017

  33. [33]

    Essentially no barriers in neural network energy landscape

    Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1309–1318. PMLR, 10–15 Jul 2018. URLhttps://procee ...

  34. [34]

    Averaging weights leads to wider optima and better generalization

    Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. InUncertainty in Artificial Intelligence (UAI), 2018

  35. [35]

    Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt

    Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo- Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, 2022. URLhttps://arxiv.org/abs/2203 .05482

  36. [36]

    Leslie N. Smith. General cyclical training of neural networks, 2022. URLhttps://arxiv. org/abs/2202.08835

  37. [37]

    Distilling the knowledge in a neural network,

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

  38. [38]

    URLhttps://arxiv.org/abs/1503.02531

  39. [39]

    Chenglin Yang, Lingxi Xie, Chi Su, and Alan L. Yuille. Snapshot distillation: Teacher-student optimization in one generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  40. [40]

    Efficient knowledge distillation from model checkpoints

    Chaofei Wang, Qisen Yang, Rui Huang, Shiji Song, and Gao Huang. Efficient knowledge distillation from model checkpoints. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, 15 pages 607–619. Curran Associates, Inc., 2022. URLhttps://proceedings.neurips.cc /paper_files/...

  41. [41]

    David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992. ISSN 0893-

  42. [42]

    Stacked generalization

    doi: https://doi.org/10.1016/S0893-6080(05)80023-1. URLhttps://www.scienc edirect.com/science/article/pii/S0893608005800231. 16 A Overview of hyper-epoch pretraining floor peak Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle C Training progress LR multiplier Multiple checkpoints harvestedfrom one training run using cyclic LR Standard cosine LRPer-cycle cyclic LR...