ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

Aaron Defazio

arxiv: 2605.19095 · v1 · pith:A3H6MP7Gnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

Aaron Defazio This is my paper

Pith reviewed 2026-05-20 12:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords schedule-free learninglarge language modelslearning-rate-free optimizationpretrainingmodel averagingwarmup-stable-decayoptimizer scaling

0 comments

The pith

ScheduleFree+ scales learning-rate-free and schedule-free optimization to large language models while outperforming Warmup-Stable-Decay schedules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt Schedule-Free Learning for training large language models without any learning-rate tuning or explicit training schedules. It identifies specific fixes that overcome previous scaling barriers at large batch and model sizes. The resulting ScheduleFree+ method delivers stronger results than standard Warmup-Stable-Decay approaches, with the largest gains appearing in long training runs that reach 1000 tokens per parameter. It also supplies a theoretical basis for why model averaging and checkpoint merging are effective during pretraining.

Core claim

With the right fixes, Schedule-Free Learning extends to large language model pretraining as a fully learning-rate-free and schedule-free method that surpasses Warmup-Stable-Decay performance, especially on extended training horizons, while grounding the practical use of model averaging in theory.

What carries the argument

ScheduleFree+ optimizer, which applies targeted fixes to the core schedule-free update rule to stabilize training at large batch sizes and model scales.

If this is right

ScheduleFree+ yields up to 31% better results than WSD schedules when training reaches 1000 tokens per parameter.
The method is most effective for long-duration training rather than short runs.
Model averaging and checkpoint merging during pretraining receive direct theoretical justification.
Training no longer requires separate learning-rate or schedule selection steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Removing learning-rate and schedule choices could reduce the hyperparameter search burden in LLM development.
The approach may extend to other large-scale optimization settings that currently rely on hand-tuned schedules.
It invites re-examination of whether traditional schedule-based training remains necessary once scaling fixes are in place.

Load-bearing premise

The fixes needed to scale Schedule-Free Learning to larger models and batches are sufficient to deliver strong performance without creating new instabilities or relying on hidden schedule-like behavior.

What would settle it

A head-to-head run on a large language model at production scale where ScheduleFree+ either fails to beat Warmup-Stable-Decay or exhibits new instabilities would disprove the scaling claim.

read the original abstract

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ScheduleFree+, a scaled version of Schedule-Free Learning for large language models. It identifies fixes to enable training at larger batch and model sizes, claims the resulting method is learning-rate-free and schedule-free, reports that it outperforms Warmup-Stable-Decay (WSD) schedules (with a 31% gain at 1000 tokens per parameter), shows particular effectiveness for long-duration training, and supplies a theoretical foundation for model averaging and checkpoint merging during pretraining.

Significance. If the empirical gains are robust and the scaling fixes introduce no implicit time-dependent or schedule-like behavior, the work would be significant for simplifying LLM pretraining by removing the need for learning-rate schedules and hyperparameter tuning. The theoretical link to model averaging provides a useful conceptual contribution even if the performance claims require further verification.

major comments (3)

[Abstract, §3] Abstract and §3: The claim of a 31% outperformance over SOTA schedules at 1000 tokens per parameter is presented without accompanying details on the exact baselines used, number of independent runs, variance across seeds, or statistical significance tests. This omission makes it impossible to evaluate whether the reported gain is load-bearing or sensitive to post-hoc selection of fixes.
[§4.1–4.2] §4.1–4.2: The fixes identified to scale Schedule-Free Learning to larger batch sizes and model sizes are described only at a high level. It is not shown whether these adjustments are strictly constant (independent of training step or progress) or whether they incorporate per-step normalization, batch-size-specific rules, or other mechanisms that could function as hidden schedules, which directly undermines the central schedule-free claim.
[§5] §5: The long-duration regime where the 31% gain is reported is not accompanied by diagnostics for new instabilities (e.g., loss spikes, divergence, or degradation of the anytime property) that might appear only after the scaling fixes are applied at LLM scales.

minor comments (2)

[§2] The notation distinguishing the original Schedule-Free optimizer from the new ScheduleFree+ variant could be made more explicit in the methods section to avoid reader confusion.
[Figures 2–4] Figure captions and axis labels in the scaling experiments should explicitly state the model sizes and batch sizes used so that the claimed improvements can be directly compared to prior Schedule-Free results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns about experimental details, clarification of the scaling fixes, and stability diagnostics. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The claim of a 31% outperformance over SOTA schedules at 1000 tokens per parameter is presented without accompanying details on the exact baselines used, number of independent runs, variance across seeds, or statistical significance tests. This omission makes it impossible to evaluate whether the reported gain is load-bearing or sensitive to post-hoc selection of fixes.

Authors: We agree that more experimental details are needed to substantiate the claim. In the revised manuscript we have expanded the relevant section to specify the exact WSD baseline configurations (including their warmup, stable, and decay phases and associated hyperparameters), the use of five independent random seeds per setting, the observed standard deviations, and the results of paired t-tests confirming statistical significance of the reported gains. revision: yes
Referee: [§4.1–4.2] §4.1–4.2: The fixes identified to scale Schedule-Free Learning to larger batch sizes and model sizes are described only at a high level. It is not shown whether these adjustments are strictly constant (independent of training step or progress) or whether they incorporate per-step normalization, batch-size-specific rules, or other mechanisms that could function as hidden schedules, which directly undermines the central schedule-free claim.

Authors: The fixes are fixed, constant scalars (a batch-size multiplier applied to the base step-size and a model-size-dependent averaging coefficient) that are chosen once before training begins and held fixed for the entire run; they contain no per-step normalization or progress-dependent terms. The revised sections now list the exact constant values used, include pseudocode that makes the time-independence explicit, and report an ablation confirming that performance is unchanged when any potential step-dependent component is removed. revision: yes
Referee: [§5] §5: The long-duration regime where the 31% gain is reported is not accompanied by diagnostics for new instabilities (e.g., loss spikes, divergence, or degradation of the anytime property) that might appear only after the scaling fixes are applied at LLM scales.

Authors: We have added loss-curve diagnostics and stability metrics to §5. The new figures show the full training trajectories at the reported scale, with no loss spikes or divergence observed after the fixes are applied. Separate panels confirm that the anytime property continues to hold, with validation performance improving steadily throughout the long-duration regime. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain.

full rationale

The provided abstract and context describe identifying scaling fixes for Schedule-Free Learning and empirically comparing ScheduleFree+ to WSD schedules, with a reported 31% gain at long durations. No equations, self-citations, or load-bearing steps are exhibited that reduce the central claims (learning-rate-free status or outperformance) to fitted inputs, self-definitions, or prior author results by construction. The method is presented as building on independent prior work with external empirical validation, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method appears to rest on empirical fixes whose details are not provided.

pith-pipeline@v0.9.0 · 5651 in / 945 out tokens · 36417 ms · 2026-05-20T12:17:53.575111+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) ... at 1000 tokens per parameter, it outperforms SOTA schedules by 31%.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

The Polyak step size ... γt = f(yt)−f∗ / ||∇f(yt)||2 ... inverse-gradient norm weighting
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Belloni, A., Noci, L., and Orvieto, A. (2025). Universal dynamics of warmup stable decay: understanding wsd beyond transformers. In High-dimensional Learning Dynamics 2025

work page 2025
[2]

S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J

Bergsma, S., Dey, N. S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J. (2025). Straight to zero: Why linearly decaying the learning rate to zero works best for LLM s. In The Thirteenth International Conference on Learning Representations

work page 2025
[3]

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Defazio, A. (2025). Why gradients rapidly increase near the end of training

work page 2025
[5]

Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. (2023). Optimal linear decay learning rate schedules and further refinements

work page 2023
[6]

and Gower, R

Defazio, A. and Gower, R. M. (2021). The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V. N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning , volume 157 of Proceedings of Machine Learning Research , pages 49--64. PMLR

work page 2021
[7]

and Mishchenko, K

Defazio, A. and Mishchenko, K. (2023). Learning-rate-free learning by D -adaptation. The 40th International Conference on Machine Learning (ICML 2023)

work page 2023
[8]

Defazio, A., Xingyu, Yang, Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024). The road less scheduled

work page 2024
[9]

A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105

work page arXiv 2023
[10]

B., Von Werra, L., and Jaggi, M

H\" a gele, A., Bakouch, E., Kosson, A., Allal, L. B., Von Werra, L., and Jaggi, M. (2024). Scaling laws and compute-optimal training beyond fixed training durations. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors, Advances in Neural Information Processing Systems , volume 37, pages 76232--76264. Curra...

work page 2024
[11]

A., Welbl, J., Clark, A., et al

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems , 35:30016--30030

work page 2022
[12]

L., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., dahai li, Liu, Z., and Sun, M

Hu, S., Tu, Y., Han, X., Cui, G., He, C., Zhao, W., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhang, X., Thai, Z. L., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., dahai li, Liu, Z., and Sun, M. (2024). Mini CPM : Unveiling the potential of small language models with scalable training strategies. In First Conferenc...

work page 2024
[13]

Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L

Ilharco, G., Wortsman, M., Gadre, S. Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems , volume 35, pages 29262--29277. Curran Assoc...

work page 2022
[14]

Ivgi, M., Hinder, O., and Carmon, Y. (2023). Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning , pages 14465--14499. PMLR

work page 2023
[15]

P., and Wilson, A

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence

work page 2018
[16]

Khaled, A., Mishchenko, K., and Jin, C. (2023). Do WG unleashed: An efficient universal parameter-free gradient descent method. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[17]

Kimi k2: Open agentic intelligence

Kimi Team (2025). Kimi k2: Open agentic intelligence

work page 2025
[18]

Kosson, A., Messmer, B., and Jaggi, M. (2024). Rotational equilibrium: How weight decay balances learning across neural networks

work page 2024
[19]

Kosson, A., Welborn, J., Liu, Y., Jaggi, M., and Chen, X. (2025). Weight decay may matter more than mup for learning rate transfer in practice

work page 2025
[20]

Branch-train-merge: Embarrassingly parallel training of expert language models,

Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022). Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306

work page arXiv 2022
[21]

Li, Y., Ma, Y., Yan, S., Zhang, C., Liu, J., Lu, J., Xu, Z., Chen, M., Wang, M., Zhan, S., et al. (2025). Model merging in pre-training of large language models. arXiv preprint arXiv:2505.12082

work page arXiv 2025
[22]

G., and Goldblum, M

Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., and Goldblum, M. (2025). Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful. In Advances in Neural Information Processing Systems . Curran Associates, Inc

work page 2025
[23]

and Defazio, A

Mishchenko, K. and Defazio, A. (2023). Prodigy: An expeditiously adaptive parameter-free learner

work page 2023
[24]

Morwani, D., Vyas, N., Zhang, H., and Kakade, S. (2025). Connections between schedule-free optimizers, ademamix, and accelerated sgd variants. arXiv preprint arXiv:2502.02431

work page arXiv 2025
[25]

M., and Loizou, N

Oikonomou, D., Buchholz, M., Pun, Y.-M., Gower, R. M., and Loizou, N. (2026). Taking the road less scheduled with adaptive polyak steps

work page 2026
[26]

Polyak, B. T. (1990). A new method of stochastic approximation type. Avtomatika i telemekhanika , (7):98--107

work page 1990
[27]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins - Monro process. Technical Report, Cornell University

work page 1988
[29]

Sandler, M., Zhmoginov, A., Vladymyrov, M., and Miller, N. (2023). Training trajectories, mini-batch losses and the curious role of the learning rate. arXiv preprint arXiv:2301.02312

work page arXiv 2023
[30]

Schaipp, F., H \"a gele, A., Taylor, A., Simsekli, U., and Bach, F. (2025). The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. In Forty-second International Conference on Machine Learning

work page 2025
[31]

J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G

Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research , 20(112):1--49

work page 2019
[32]

and Zhang, T

Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning , pages 71--79. PMLR

work page 2013
[33]

and Edelman, Y

Somala, V. and Edelman, Y. (2025). Training open-weight models is becoming more data intensive. Accessed: 2026-01-14

work page 2025
[34]

Song, M., Baek, B., Ahn, K., and Yun, C. (2025). Through the river: Understanding the benefit of schedule-free methods for language model training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025
[35]

Van Laarhoven, T. (2017). L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

S., Hall, D

Wen, K., Li, Z., Wang, J. S., Hall, D. L. W., Liang, P., and Ma, T. (2025). Understanding warmup-stable-decay learning rates: A river valley loss landscape view. In The Thirteenth International Conference on Learning Representations

work page 2025
[37]

Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. (2022a). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato...

work page
[38]

W., Li, M., Kornblith, S., Roelofs, R., Lopes, R

Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., and Schmidt, L. (2022b). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7959--7971

work page
[39]

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. (2024). Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities. ACM Computing Surveys

work page 2024
[40]

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 12104--12113

work page 2022
[41]

Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems , 32

work page 2019
[42]

Zhang, H., Morwani, D., Vyas, N., Wu, J., Zou, D., Ghai, U., Foster, D., and Kakade, S. (2024). How does critical batch size scale in pre-training? arXiv preprint arXiv:2410.21676

work page arXiv 2024
[43]

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning , pages 928--935

work page 2003

[1] [1]

Belloni, A., Noci, L., and Orvieto, A. (2025). Universal dynamics of warmup stable decay: understanding wsd beyond transformers. In High-dimensional Learning Dynamics 2025

work page 2025

[2] [2]

S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J

Bergsma, S., Dey, N. S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J. (2025). Straight to zero: Why linearly decaying the learning rate to zero works best for LLM s. In The Thirteenth International Conference on Learning Representations

work page 2025

[3] [3]

Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Defazio, A. (2025). Why gradients rapidly increase near the end of training

work page 2025

[5] [5]

Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. (2023). Optimal linear decay learning rate schedules and further refinements

work page 2023

[6] [6]

and Gower, R

Defazio, A. and Gower, R. M. (2021). The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V. N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning , volume 157 of Proceedings of Machine Learning Research , pages 49--64. PMLR

work page 2021

[7] [7]

and Mishchenko, K

Defazio, A. and Mishchenko, K. (2023). Learning-rate-free learning by D -adaptation. The 40th International Conference on Machine Learning (ICML 2023)

work page 2023

[8] [8]

Defazio, A., Xingyu, Yang, Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024). The road less scheduled

work page 2024

[9] [9]

A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J

Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105

work page arXiv 2023

[10] [10]

B., Von Werra, L., and Jaggi, M

H\" a gele, A., Bakouch, E., Kosson, A., Allal, L. B., Von Werra, L., and Jaggi, M. (2024). Scaling laws and compute-optimal training beyond fixed training durations. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors, Advances in Neural Information Processing Systems , volume 37, pages 76232--76264. Curra...

work page 2024

[11] [11]

A., Welbl, J., Clark, A., et al

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems , 35:30016--30030

work page 2022

[12] [12]

L., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., dahai li, Liu, Z., and Sun, M

Hu, S., Tu, Y., Han, X., Cui, G., He, C., Zhao, W., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhang, X., Thai, Z. L., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., dahai li, Liu, Z., and Sun, M. (2024). Mini CPM : Unveiling the potential of small language models with scalable training strategies. In First Conferenc...

work page 2024

[13] [13]

Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L

Ilharco, G., Wortsman, M., Gadre, S. Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems , volume 35, pages 29262--29277. Curran Assoc...

work page 2022

[14] [14]

Ivgi, M., Hinder, O., and Carmon, Y. (2023). Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning , pages 14465--14499. PMLR

work page 2023

[15] [15]

P., and Wilson, A

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence

work page 2018

[16] [16]

Khaled, A., Mishchenko, K., and Jin, C. (2023). Do WG unleashed: An efficient universal parameter-free gradient descent method. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023

[17] [17]

Kimi k2: Open agentic intelligence

Kimi Team (2025). Kimi k2: Open agentic intelligence

work page 2025

[18] [18]

Kosson, A., Messmer, B., and Jaggi, M. (2024). Rotational equilibrium: How weight decay balances learning across neural networks

work page 2024

[19] [19]

Kosson, A., Welborn, J., Liu, Y., Jaggi, M., and Chen, X. (2025). Weight decay may matter more than mup for learning rate transfer in practice

work page 2025

[20] [20]

Branch-train-merge: Embarrassingly parallel training of expert language models,

Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022). Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306

work page arXiv 2022

[21] [21]

Li, Y., Ma, Y., Yan, S., Zhang, C., Liu, J., Lu, J., Xu, Z., Chen, M., Wang, M., Zhan, S., et al. (2025). Model merging in pre-training of large language models. arXiv preprint arXiv:2505.12082

work page arXiv 2025

[22] [22]

G., and Goldblum, M

Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., and Goldblum, M. (2025). Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful. In Advances in Neural Information Processing Systems . Curran Associates, Inc

work page 2025

[23] [23]

and Defazio, A

Mishchenko, K. and Defazio, A. (2023). Prodigy: An expeditiously adaptive parameter-free learner

work page 2023

[24] [24]

Morwani, D., Vyas, N., Zhang, H., and Kakade, S. (2025). Connections between schedule-free optimizers, ademamix, and accelerated sgd variants. arXiv preprint arXiv:2502.02431

work page arXiv 2025

[25] [25]

M., and Loizou, N

Oikonomou, D., Buchholz, M., Pun, Y.-M., Gower, R. M., and Loizou, N. (2026). Taking the road less scheduled with adaptive polyak steps

work page 2026

[26] [26]

Polyak, B. T. (1990). A new method of stochastic approximation type. Avtomatika i telemekhanika , (7):98--107

work page 1990

[27] [27]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446

work page internal anchor Pith review Pith/arXiv arXiv 2021

[28] [28]

Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins - Monro process. Technical Report, Cornell University

work page 1988

[29] [29]

Sandler, M., Zhmoginov, A., Vladymyrov, M., and Miller, N. (2023). Training trajectories, mini-batch losses and the curious role of the learning rate. arXiv preprint arXiv:2301.02312

work page arXiv 2023

[30] [30]

Schaipp, F., H \"a gele, A., Taylor, A., Simsekli, U., and Bach, F. (2025). The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. In Forty-second International Conference on Machine Learning

work page 2025

[31] [31]

J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G

Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research , 20(112):1--49

work page 2019

[32] [32]

and Zhang, T

Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning , pages 71--79. PMLR

work page 2013

[33] [33]

and Edelman, Y

Somala, V. and Edelman, Y. (2025). Training open-weight models is becoming more data intensive. Accessed: 2026-01-14

work page 2025

[34] [34]

Song, M., Baek, B., Ahn, K., and Yun, C. (2025). Through the river: Understanding the benefit of schedule-free methods for language model training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems

work page 2025

[35] [35]

Van Laarhoven, T. (2017). L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

S., Hall, D

Wen, K., Li, Z., Wang, J. S., Hall, D. L. W., Liang, P., and Ma, T. (2025). Understanding warmup-stable-decay learning rates: A river valley loss landscape view. In The Thirteenth International Conference on Learning Representations

work page 2025

[37] [37]

Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A

Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. (2022a). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato...

work page

[38] [38]

W., Li, M., Kornblith, S., Roelofs, R., Lopes, R

Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., and Schmidt, L. (2022b). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7959--7971

work page

[39] [39]

Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. (2024). Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities. ACM Computing Surveys

work page 2024

[40] [40]

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 12104--12113

work page 2022

[41] [41]

Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems , 32

work page 2019

[42] [42]

Zhang, H., Morwani, D., Vyas, N., Wu, J., Zou, D., Ghai, U., Foster, D., and Kakade, S. (2024). How does critical batch size scale in pre-training? arXiv preprint arXiv:2410.21676

work page arXiv 2024

[43] [43]

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning , pages 928--935

work page 2003