Recognition: no theorem link
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model
Pith reviewed 2026-05-16 06:56 UTC · model grok-4.3
The pith
Optimal learning rate schedules in a random feature model split into easy-phase polynomial decay and hard-phase warmup-stable-decay depending on the task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a power-law random feature model, the optimal SGD learning rate schedule η_T^*(t) takes a polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase and resembles a warmup-stable-decay schedule in the hard phase where annealing occurs over a vanishing fraction of steps. The exponents ξ and δ are determined by the feature spectrum and task difficulty. Joint optimization with batch size and momentum schedules yields further improvements in scaling.
What carries the argument
The solvable dynamics of the power-law random feature model under SGD, obtained via optimal control theory applied to the loss evolution.
Load-bearing premise
The power-law random feature model with the given eigenvalue spectrum and quadratic loss captures the essential optimization dynamics of real deep networks under SGD.
What would settle it
Measure the optimal constant learning rate for ResNet training on CIFAR-5M across increasing training horizons with sufficient annealing; if it remains independent of horizon length the hard-phase prediction holds, but if it decreases the claim is falsified.
Figures
read the original abstract
Setting the learning rate (LR) for a deep learning model is a critical part of successful training. Choosing LRs is often done empirically with trial and error. In this work, we explore a solvable model of optimal LR schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $\eta_T^\star(t)$ where $t$ is the current iterate and $T$ is the training horizon. This schedule is computed both as a numerical optimization problem and also analytically using optimal control theory. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $\eta_T^\star(t) \simeq T^{-\xi} (1-t/T)^{\delta}$ where $\xi$ and $\delta$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant initial LR and annealing performed over a vanishing fraction of training steps. We investigate joint optimization of LR and batch size and find batch ramps can improve the wall-clock time in the easy phase. Beyond SGD, we derive optimal schedules for momentum parameter $\beta(t)$ and show that it improves the loss-scaling exponent in the hard phase. We compare our optimal schedule to various benchmarks including (1) optimal constant learning rates $\eta_T(t) \sim T^{-\xi}$ (2) optimal power laws $\eta_T(t) \sim T^{-\xi} t^{-\chi}$, finding that our schedule achieves better rates than either of these. Our theory suggests that LR transfer across training horizon depends on the structure of the model and task. For ResNet image classification on CIFAR-5M, the learning curves exhibit hard-phase behavior where optimal base LRs are constant under sufficient annealing. GPT-2 style transformers trained in language modeling exhibit easy-phase behavior where optimal LRs shift even under annealing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a theory of optimal learning rate schedules for a power-law random feature model trained with SGD. Using both numerical optimization and optimal control theory, it identifies two regimes: an easy phase where the optimal schedule takes the form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ with ξ, δ depending on feature and task properties, and a hard phase resembling warmup-stable-decay with constant initial LR and annealing over a vanishing fraction of steps. The derived schedules are shown to outperform optimal constant and power-law baselines; extensions to joint LR-batch size optimization and time-dependent momentum are derived, with suggestive comparisons to ResNet training on CIFAR-5M (hard-phase behavior) and GPT-2 transformers (easy-phase behavior).
Significance. If the derivations hold, the work supplies an analytically tractable setting in which optimal LR schedules and their scaling with horizon T can be obtained exactly, including explicit phase distinctions and outperformance over standard baselines. The combination of optimal-control analysis with numerical verification, together with the extensions to batch-size ramps and momentum, constitutes a clear strength for understanding scaling laws within this solvable model. The mapping to real architectures is presented as suggestive rather than rigorous.
major comments (2)
- [§3] §3 (optimal control derivation): the reduction from the continuous-time dynamics to the explicit polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase requires the explicit Hamiltonian, adjoint equations, and boundary conditions; without these the dependence of ξ and δ on the power-law eigenvalue spectrum remains opaque and the claim of an analytical solution cannot be verified.
- [§5] §5 (hard-phase analysis): the statement that annealing occurs over a vanishing fraction of steps as T→∞ is load-bearing for the warmup-stable-decay characterization, yet the scaling of the annealing interval with T is not derived explicitly from the optimal-control problem; a concrete asymptotic calculation is needed to confirm the limit.
minor comments (2)
- The abstract and introduction use η_T^*(t) without first defining the horizon T; a brief parenthetical definition would improve readability.
- Figure captions for the ResNet and GPT-2 comparisons should state the precise metric (e.g., test loss or accuracy) and the number of independent runs used to generate the curves.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (optimal control derivation): the reduction from the continuous-time dynamics to the explicit polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase requires the explicit Hamiltonian, adjoint equations, and boundary conditions; without these the dependence of ξ and δ on the power-law eigenvalue spectrum remains opaque and the claim of an analytical solution cannot be verified.
Authors: We agree that the explicit optimal-control steps would make the derivation more transparent. In the revised manuscript we will add the Hamiltonian, adjoint equations, and boundary conditions to §3 (with a short appendix if needed), explicitly tracing how the power-law eigenvalue spectrum determines the exponents ξ and δ in the polynomial schedule. This will allow direct verification of the analytical solution. revision: yes
-
Referee: [§5] §5 (hard-phase analysis): the statement that annealing occurs over a vanishing fraction of steps as T→∞ is load-bearing for the warmup-stable-decay characterization, yet the scaling of the annealing interval with T is not derived explicitly from the optimal-control problem; a concrete asymptotic calculation is needed to confirm the limit.
Authors: We thank the referee for highlighting this point. We will insert a concrete asymptotic calculation in §5 that derives the scaling of the annealing interval with T from the optimal-control problem, confirming that the interval vanishes as T → ∞ (specifically as T^{-α} for α > 0 determined by the model parameters). This will rigorously support the warmup-stable-decay characterization. revision: yes
Circularity Check
Derivation self-contained via optimal control on model dynamics
full rationale
The central claims derive optimal LR schedules analytically via optimal control theory applied to the SGD dynamics of the power-law random feature model with quadratic loss and assumed eigenvalue spectrum. The easy/hard phase distinction and explicit forms (polynomial decay or warmup-stable-decay) follow directly from solving the resulting control problem; numerical optimization cross-checks are performed inside the same model. No load-bearing step reduces to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The mapping to ResNet/transformer behavior is presented as suggestive only and does not support the theoretical results.
Axiom & Free-Parameter Ledger
free parameters (1)
- exponents ξ and δ
axioms (2)
- domain assumption The random feature model with power-law eigenvalues and quadratic loss admits an exact description of SGD dynamics.
- standard math Optimal control theory yields the globally optimal schedule for the given finite-horizon objective.
Reference graph
Works this paper leans on
-
[1]
Scaling optimal lr across token horizons.arXiv preprint arXiv:2409.19913,
Bjorck, J., Benhaim, A., Chaudhary, V ., Wei, F., and Song, X. Scaling optimal lr across token horizons.arXiv preprint arXiv:2409.19913,
-
[2]
URL https:// openreview.net/forum?id=WPI2vbkAl3Q. Bordelon, B., Noci, L., Li, M. B., Hanin, B., and Pehle- van, C. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit.arXiv preprint arXiv:2309.16620,
-
[3]
Bordelon, B., Atanasov, A., and Pehlevan, C
URL https: //openreview.net/forum?id=nbOY1OmtRc. Bordelon, B., Atanasov, A., and Pehlevan, C. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8): 084002,
work page 2025
- [4]
-
[5]
Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. Optimal linear decay learning rate schedules and further refinements.arXiv preprint arXiv:2310.07831,
-
[6]
C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J
Dey, N., Zhang, B. C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J. Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618,
-
[7]
Everett, K., Xiao, L., Wortsman, M., Alemi, A. A., Novak, R., Liu, P. J., Gur, I., Sohl-Dickstein, J., Kaelbling, L. P., Lee, J., et al. Scaling exponents across parameterizations and optimizers.arXiv preprint arXiv:2407.05872,
-
[8]
Dimension-adapted momentum outscales sgd
Ferbach, D., Everett, K., Gidel, G., Paquette, E., and Pa- quette, C. Dimension-adapted momentum outscales sgd. arXiv preprint arXiv:2505.16098,
-
[9]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
9 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute- optimal large language models. Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ....
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Laws for Neural Language Models
URL https: //kellerjordan.github.io/posts/muon/. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[11]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 22,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Liu, Z., Liu, Y ., Gore, J., and Tegmark, M
URL https: //openreview.net/forum?id=PH7sdEanXP. Liu, Z., Liu, Y ., Gore, J., and Tegmark, M. Neural thermo- dynamic laws for large language model training.arXiv preprint arXiv:2505.10559,
-
[13]
Meterez, A., Morwani, D., Wu, J., Oncescu, C.-A., Pehlevan, C., and Kakade, S. Seesaw: Accelerating training by balancing learning rate and batch size scheduling.arXiv preprint arXiv:2510.14717,
-
[14]
Mignacco, F. and Mori, F. A statistical physics framework for optimal learning.arXiv preprint arXiv:2507.07907,
-
[15]
Mori, F., Mannelli, S. S., and Mignacco, F. Optimal pro- tocols for continual learning via statistical physics and control theory.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084004,
work page 2025
-
[16]
Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep boot- strap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127,
-
[17]
Njaradi, V ., Carrasco-Davis, R., Latham, P. E., and Saxe, A. Optimal learning rate schedule for balancing effort and performance.arXiv preprint arXiv:2601.07830,
-
[18]
G., Pennington, J., and Agar- wala, A
Qiu, S., Xiao, L., Wilson, A. G., Pennington, J., and Agar- wala, A. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks.arXiv preprint arXiv:2507.02119,
-
[19]
Don't Decay the Learning Rate, Increase the Batch Size
Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V . Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
SOAP: Improving and Stabilizing Shampoo using Adam
Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Wen, K., Li, Z., Wang, J., Hall, D., Liang, P., and Ma, T. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192,
-
[22]
arXiv preprint arXiv:2203.03466 , year=
Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs v: Tuning large neural networks 10 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,
-
[23]
Feature learn- ing in infinite-depth neural networks
Yang, G., Yu, D., Zhu, C., and Hayou, S. Feature learn- ing in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning,
work page 2023
-
[24]
η(t) = 1fort < t s , 1−t/T 1−ts/T δ fort > t s , (82) with δ= 2b−1 and T−t s ∼T 1−(b−a)/(2b−1). The edge boundary t∗ is determined by the condition χ(t∗) = 1, yielding χ(t∗) = Z T t∗ dt′η(t′) = Z T t∗ dt′ 1−t ′/T 1−t s/T δ =T ϵ −δ(1−τ ∗)δ+1 = 1,(83) 17 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model where we hav...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.