ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models
Pith reviewed 2026-05-20 12:17 UTC · model grok-4.3
The pith
ScheduleFree+ scales learning-rate-free and schedule-free optimization to large language models while outperforming Warmup-Stable-Decay schedules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With the right fixes, Schedule-Free Learning extends to large language model pretraining as a fully learning-rate-free and schedule-free method that surpasses Warmup-Stable-Decay performance, especially on extended training horizons, while grounding the practical use of model averaging in theory.
What carries the argument
ScheduleFree+ optimizer, which applies targeted fixes to the core schedule-free update rule to stabilize training at large batch sizes and model scales.
If this is right
- ScheduleFree+ yields up to 31% better results than WSD schedules when training reaches 1000 tokens per parameter.
- The method is most effective for long-duration training rather than short runs.
- Model averaging and checkpoint merging during pretraining receive direct theoretical justification.
- Training no longer requires separate learning-rate or schedule selection steps.
Where Pith is reading between the lines
- Removing learning-rate and schedule choices could reduce the hyperparameter search burden in LLM development.
- The approach may extend to other large-scale optimization settings that currently rely on hand-tuned schedules.
- It invites re-examination of whether traditional schedule-based training remains necessary once scaling fixes are in place.
Load-bearing premise
The fixes needed to scale Schedule-Free Learning to larger models and batches are sufficient to deliver strong performance without creating new instabilities or relying on hidden schedule-like behavior.
What would settle it
A head-to-head run on a large language model at production scale where ScheduleFree+ either fails to beat Warmup-Stable-Decay or exhibits new instabilities would disprove the scaling claim.
read the original abstract
Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ScheduleFree+, a scaled version of Schedule-Free Learning for large language models. It identifies fixes to enable training at larger batch and model sizes, claims the resulting method is learning-rate-free and schedule-free, reports that it outperforms Warmup-Stable-Decay (WSD) schedules (with a 31% gain at 1000 tokens per parameter), shows particular effectiveness for long-duration training, and supplies a theoretical foundation for model averaging and checkpoint merging during pretraining.
Significance. If the empirical gains are robust and the scaling fixes introduce no implicit time-dependent or schedule-like behavior, the work would be significant for simplifying LLM pretraining by removing the need for learning-rate schedules and hyperparameter tuning. The theoretical link to model averaging provides a useful conceptual contribution even if the performance claims require further verification.
major comments (3)
- [Abstract, §3] Abstract and §3: The claim of a 31% outperformance over SOTA schedules at 1000 tokens per parameter is presented without accompanying details on the exact baselines used, number of independent runs, variance across seeds, or statistical significance tests. This omission makes it impossible to evaluate whether the reported gain is load-bearing or sensitive to post-hoc selection of fixes.
- [§4.1–4.2] §4.1–4.2: The fixes identified to scale Schedule-Free Learning to larger batch sizes and model sizes are described only at a high level. It is not shown whether these adjustments are strictly constant (independent of training step or progress) or whether they incorporate per-step normalization, batch-size-specific rules, or other mechanisms that could function as hidden schedules, which directly undermines the central schedule-free claim.
- [§5] §5: The long-duration regime where the 31% gain is reported is not accompanied by diagnostics for new instabilities (e.g., loss spikes, divergence, or degradation of the anytime property) that might appear only after the scaling fixes are applied at LLM scales.
minor comments (2)
- [§2] The notation distinguishing the original Schedule-Free optimizer from the new ScheduleFree+ variant could be made more explicit in the methods section to avoid reader confusion.
- [Figures 2–4] Figure captions and axis labels in the scaling experiments should explicitly state the model sizes and batch sizes used so that the claimed improvements can be directly compared to prior Schedule-Free results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have carefully considered each major comment and revised the paper to address the concerns about experimental details, clarification of the scaling fixes, and stability diagnostics. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: The claim of a 31% outperformance over SOTA schedules at 1000 tokens per parameter is presented without accompanying details on the exact baselines used, number of independent runs, variance across seeds, or statistical significance tests. This omission makes it impossible to evaluate whether the reported gain is load-bearing or sensitive to post-hoc selection of fixes.
Authors: We agree that more experimental details are needed to substantiate the claim. In the revised manuscript we have expanded the relevant section to specify the exact WSD baseline configurations (including their warmup, stable, and decay phases and associated hyperparameters), the use of five independent random seeds per setting, the observed standard deviations, and the results of paired t-tests confirming statistical significance of the reported gains. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2: The fixes identified to scale Schedule-Free Learning to larger batch sizes and model sizes are described only at a high level. It is not shown whether these adjustments are strictly constant (independent of training step or progress) or whether they incorporate per-step normalization, batch-size-specific rules, or other mechanisms that could function as hidden schedules, which directly undermines the central schedule-free claim.
Authors: The fixes are fixed, constant scalars (a batch-size multiplier applied to the base step-size and a model-size-dependent averaging coefficient) that are chosen once before training begins and held fixed for the entire run; they contain no per-step normalization or progress-dependent terms. The revised sections now list the exact constant values used, include pseudocode that makes the time-independence explicit, and report an ablation confirming that performance is unchanged when any potential step-dependent component is removed. revision: yes
-
Referee: [§5] §5: The long-duration regime where the 31% gain is reported is not accompanied by diagnostics for new instabilities (e.g., loss spikes, divergence, or degradation of the anytime property) that might appear only after the scaling fixes are applied at LLM scales.
Authors: We have added loss-curve diagnostics and stability metrics to §5. The new figures show the full training trajectories at the reported scale, with no loss spikes or divergence observed after the fixes are applied. Separate panels confirm that the anytime property continues to hold, with validation performance improving steadily throughout the long-duration regime. revision: yes
Circularity Check
No significant circularity detected in derivation chain.
full rationale
The provided abstract and context describe identifying scaling fixes for Schedule-Free Learning and empirically comparing ScheduleFree+ to WSD schedules, with a reported 31% gain at long durations. No equations, self-citations, or load-bearing steps are exhibited that reduce the central claims (learning-rate-free status or outperformance) to fitted inputs, self-definitions, or prior author results by construction. The method is presented as building on independent prior work with external empirical validation, making the derivation self-contained against benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) ... at 1000 tokens per parameter, it outperforms SOTA schedules by 31%.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
The Polyak step size ... γt = f(yt)−f∗ / ||∇f(yt)||2 ... inverse-gradient norm weighting
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_add unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Belloni, A., Noci, L., and Orvieto, A. (2025). Universal dynamics of warmup stable decay: understanding wsd beyond transformers. In High-dimensional Learning Dynamics 2025
work page 2025
-
[2]
S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J
Bergsma, S., Dey, N. S., Gosal, G., Gray, G., Soboleva, D., and Hestness, J. (2025). Straight to zero: Why linearly decaying the learning rate to zero works best for LLM s. In The Thirteenth International Conference on Learning Representations
work page 2025
-
[3]
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al. (2024). Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Defazio, A. (2025). Why gradients rapidly increase near the end of training
work page 2025
-
[5]
Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. (2023). Optimal linear decay learning rate schedules and further refinements
work page 2023
-
[6]
Defazio, A. and Gower, R. M. (2021). The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V. N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning , volume 157 of Proceedings of Machine Learning Research , pages 49--64. PMLR
work page 2021
-
[7]
Defazio, A. and Mishchenko, K. (2023). Learning-rate-free learning by D -adaptation. The 40th International Conference on Machine Learning (ICML 2023)
work page 2023
-
[8]
Defazio, A., Xingyu, Yang, Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024). The road less scheduled
work page 2024
-
[9]
A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J
Douillard, A., Feng, Q., Rusu, A. A., Chhaparia, R., Donchev, Y., Kuncoro, A., Ranzato, M., Szlam, A., and Shen, J. (2023). Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105
-
[10]
B., Von Werra, L., and Jaggi, M
H\" a gele, A., Bakouch, E., Kosson, A., Allal, L. B., Von Werra, L., and Jaggi, M. (2024). Scaling laws and compute-optimal training beyond fixed training durations. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors, Advances in Neural Information Processing Systems , volume 37, pages 76232--76264. Curra...
work page 2024
-
[11]
A., Welbl, J., Clark, A., et al
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. (2022). An empirical analysis of compute-optimal large language model training. Advances in neural information processing systems , 35:30016--30030
work page 2022
-
[12]
Hu, S., Tu, Y., Han, X., Cui, G., He, C., Zhao, W., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhang, X., Thai, Z. L., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., dahai li, Liu, Z., and Sun, M. (2024). Mini CPM : Unveiling the potential of small language models with scalable training strategies. In First Conferenc...
work page 2024
-
[13]
Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L
Ilharco, G., Wortsman, M., Gadre, S. Y., Song, S., Hajishirzi, H., Kornblith, S., Farhadi, A., and Schmidt, L. (2022). Patching open-vocabulary models by interpolating weights. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems , volume 35, pages 29262--29277. Curran Assoc...
work page 2022
-
[14]
Ivgi, M., Hinder, O., and Carmon, Y. (2023). Dog is sgd’s best friend: A parameter-free dynamic step size schedule. In International Conference on Machine Learning , pages 14465--14499. PMLR
work page 2023
-
[15]
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence
work page 2018
-
[16]
Khaled, A., Mishchenko, K., and Jin, C. (2023). Do WG unleashed: An efficient universal parameter-free gradient descent method. In Thirty-seventh Conference on Neural Information Processing Systems
work page 2023
-
[17]
Kimi k2: Open agentic intelligence
Kimi Team (2025). Kimi k2: Open agentic intelligence
work page 2025
-
[18]
Kosson, A., Messmer, B., and Jaggi, M. (2024). Rotational equilibrium: How weight decay balances learning across neural networks
work page 2024
-
[19]
Kosson, A., Welborn, J., Liu, Y., Jaggi, M., and Chen, X. (2025). Weight decay may matter more than mup for learning rate transfer in practice
work page 2025
-
[20]
Branch-train-merge: Embarrassingly parallel training of expert language models,
Li, M., Gururangan, S., Dettmers, T., Lewis, M., Althoff, T., Smith, N. A., and Zettlemoyer, L. (2022). Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306
- [21]
-
[22]
Marek, M., Lotfi, S., Somasundaram, A., Wilson, A. G., and Goldblum, M. (2025). Small batch size training for language models: When vanilla sgd works, and why gradient accumulation is wasteful. In Advances in Neural Information Processing Systems . Curran Associates, Inc
work page 2025
-
[23]
Mishchenko, K. and Defazio, A. (2023). Prodigy: An expeditiously adaptive parameter-free learner
work page 2023
- [24]
-
[25]
Oikonomou, D., Buchholz, M., Pun, Y.-M., Gower, R. M., and Loizou, N. (2026). Taking the road less scheduled with adaptive polyak steps
work page 2026
-
[26]
Polyak, B. T. (1990). A new method of stochastic approximation type. Avtomatika i telemekhanika , (7):98--107
work page 1990
-
[27]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins - Monro process. Technical Report, Cornell University
work page 1988
- [29]
-
[30]
Schaipp, F., H \"a gele, A., Taylor, A., Simsekli, U., and Bach, F. (2025). The surprising agreement between convex optimization theory and learning-rate scheduling for large model training. In Forty-second International Conference on Machine Learning
work page 2025
-
[31]
J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G
Shallue, C. J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., and Dahl, G. E. (2019). Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research , 20(112):1--49
work page 2019
-
[32]
Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning , pages 71--79. PMLR
work page 2013
-
[33]
Somala, V. and Edelman, Y. (2025). Training open-weight models is becoming more data intensive. Accessed: 2026-01-14
work page 2025
-
[34]
Song, M., Baek, B., Ahn, K., and Yun, C. (2025). Through the river: Understanding the benefit of schedule-free methods for language model training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[35]
Van Laarhoven, T. (2017). L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Wen, K., Li, Z., Wang, J. S., Hall, D. L. W., Liang, P., and Ma, T. (2025). Understanding warmup-stable-decay learning rates: A river valley loss landscape view. In The Thirteenth International Conference on Learning Representations
work page 2025
-
[37]
Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A. S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., and Schmidt, L. (2022a). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato...
-
[38]
W., Li, M., Kornblith, S., Roelofs, R., Lopes, R
Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., and Schmidt, L. (2022b). Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 7959--7971
-
[39]
Yang, E., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J., and Tao, D. (2024). Model merging in llms, mllms, and beyond: Methods, theories, applications, and opportunities. ACM Computing Surveys
work page 2024
-
[40]
Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. (2022). Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 12104--12113
work page 2022
-
[41]
Zhang, G., Li, L., Nado, Z., Martens, J., Sachdeva, S., Dahl, G., Shallue, C., and Grosse, R. B. (2019). Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems , 32
work page 2019
- [42]
-
[43]
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning , pages 928--935
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.