pith. machine review for the scientific record. sign in

arxiv: 2602.04774 · v2 · submitted 2026-02-04 · ❄️ cond-mat.dis-nn · cs.LG· stat.ML

Recognition: no theorem link

Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:56 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cs.LGstat.ML
keywords optimal learning raterandom feature modelscaling lawsSGD optimizationeasy phasehard phasepower law spectrum
0
0 comments X

The pith

Optimal learning rate schedules in a random feature model split into easy-phase polynomial decay and hard-phase warmup-stable-decay depending on the task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the exact optimal learning rate schedule for training a power-law random feature model with SGD over a fixed horizon T. This schedule is found by formulating the problem as an optimal control task and solving it analytically and numerically. In the easy phase the schedule decays polynomially as T to a power times (1-t/T) to another power, while in the hard phase it stays constant for most of training before annealing sharply at the end. These predictions explain why some models like transformers benefit from adjusting the base rate with training length while others like ResNets do not, as long as annealing is applied.

Core claim

For a power-law random feature model, the optimal SGD learning rate schedule η_T^*(t) takes a polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase and resembles a warmup-stable-decay schedule in the hard phase where annealing occurs over a vanishing fraction of steps. The exponents ξ and δ are determined by the feature spectrum and task difficulty. Joint optimization with batch size and momentum schedules yields further improvements in scaling.

What carries the argument

The solvable dynamics of the power-law random feature model under SGD, obtained via optimal control theory applied to the loss evolution.

Load-bearing premise

The power-law random feature model with the given eigenvalue spectrum and quadratic loss captures the essential optimization dynamics of real deep networks under SGD.

What would settle it

Measure the optimal constant learning rate for ResNet training on CIFAR-5M across increasing training horizons with sufficient annealing; if it remains independent of horizon length the hard-phase prediction holds, but if it decreases the claim is falsified.

Figures

Figures reproduced from arXiv: 2602.04774 by Blake Bordelon, Francesco Mori.

Figure 1
Figure 1. Figure 1: SGD learning rates do not automatically transfer over training horizons T. This motivates theory that can identify not only how to scale η with T, but also how to set the entire learning rate schedule η(t) with T. (a) The loss of a deep ResNet trained on CIFAR-5M. (b) Test loss of a random feature model trained with SGD as a function of fixed learning rate η. The optimal learning rate shifts leftwards in t… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of optimal learning rate schedules in the hard (top row, b > a) and easy (bottom row, b < a) phases. (a, d) Profile of the optimal learning rate η ∗ T (t). In the hard phase (a), the schedule maintains a constant maximum value for t < ts followed by a rapid annealing phase, where the annealing fraction 1 − ts/T vanishes as T → ∞ (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Decomposition of the excess loss Lt − σ 2 into bias and variance components. (a) In the easy phase (b = 2, a = 3.5), the optimal schedule minimizes bias and variance simultaneously throughout the training trajectory. (b) In the hard phase (b = 5, a = 3.5), the schedule minimizes the bias for the majority of the training time (t < ts) where the learning rate is large, while the final annealing phase (t > ts… view at source ↗
Figure 4
Figure 4. Figure 4: Compute optimal scaling. Residual loss LC − σ 2 0 as a function of the compute C = NT for different values of the model size N. The dashed lines indicate the theoretical prediction. Parameters: σ = 0.5, m = 5. In the easy phase b = 1.5 and a = 2, in the hard phase b = 2 and a = 1.5. 10 0 10 1 10 2 10 3 t 10 2 10 1 10 0 (t) 2 T = 32 T = 64 T = 128 T = 256 T = 512 T = 1024 T 1 + 1/a (a) Easy Task Loss Dynami… view at source ↗
Figure 5
Figure 5. Figure 5: Optimal Schedule and loss dynamics for SGD + momentum. (a) For the easy task regime, the numerically optimized schedule achieves the same scaling law as SGD with optimal schedule LT − σ 2 ∼ T −1+1/a. (b) The optimal momentum dynamics vary significantly across T but only weakly vary with t. (c) The learning rate for optimal momentum schedules anneals similarly to SGD in the easy phase. (d) In the hard phase… view at source ↗
Figure 6
Figure 6. Figure 6: We plot width N = 32 depth 12 convolutional ResNets trained with SGD with batch size m = 32. (a) Cross-entropy loss dynamics as a function of training time t for fixed learning rate (blue) and polynomial annealing schedule set to training horizon T. The final loss follows a better trend when using the annealing schedule. (b) Loss as a function of (η, T) and for fixed learning rate. The optimal learning rat… view at source ↗
Figure 7
Figure 7. Figure 7: Fraction 1 − ts/T of the total training time spent in the annealing regime. The switching time ts is defined here as the time at which the schedule η ∗ (t) crosses the level 0.95ηmax for the first time. Same parameters as in Fig. 2a. C. Comparison with benchmarks C.1. Constant learning rate In the case of constant learning rate ηT (t) = ηT , the integral over t in Eq. (22) can be computed, leading to the s… view at source ↗
read the original abstract

Setting the learning rate (LR) for a deep learning model is a critical part of successful training. Choosing LRs is often done empirically with trial and error. In this work, we explore a solvable model of optimal LR schedules for a powerlaw random feature model trained with stochastic gradient descent (SGD). We consider the optimal schedule $\eta_T^\star(t)$ where $t$ is the current iterate and $T$ is the training horizon. This schedule is computed both as a numerical optimization problem and also analytically using optimal control theory. Our analysis reveals two regimes which we term the easy phase and hard phase. In the easy phase the optimal schedule is a polynomial decay $\eta_T^\star(t) \simeq T^{-\xi} (1-t/T)^{\delta}$ where $\xi$ and $\delta$ depend on the properties of the features and task. In the hard phase, the optimal schedule resembles warmup-stable-decay with constant initial LR and annealing performed over a vanishing fraction of training steps. We investigate joint optimization of LR and batch size and find batch ramps can improve the wall-clock time in the easy phase. Beyond SGD, we derive optimal schedules for momentum parameter $\beta(t)$ and show that it improves the loss-scaling exponent in the hard phase. We compare our optimal schedule to various benchmarks including (1) optimal constant learning rates $\eta_T(t) \sim T^{-\xi}$ (2) optimal power laws $\eta_T(t) \sim T^{-\xi} t^{-\chi}$, finding that our schedule achieves better rates than either of these. Our theory suggests that LR transfer across training horizon depends on the structure of the model and task. For ResNet image classification on CIFAR-5M, the learning curves exhibit hard-phase behavior where optimal base LRs are constant under sufficient annealing. GPT-2 style transformers trained in language modeling exhibit easy-phase behavior where optimal LRs shift even under annealing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a theory of optimal learning rate schedules for a power-law random feature model trained with SGD. Using both numerical optimization and optimal control theory, it identifies two regimes: an easy phase where the optimal schedule takes the form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ with ξ, δ depending on feature and task properties, and a hard phase resembling warmup-stable-decay with constant initial LR and annealing over a vanishing fraction of steps. The derived schedules are shown to outperform optimal constant and power-law baselines; extensions to joint LR-batch size optimization and time-dependent momentum are derived, with suggestive comparisons to ResNet training on CIFAR-5M (hard-phase behavior) and GPT-2 transformers (easy-phase behavior).

Significance. If the derivations hold, the work supplies an analytically tractable setting in which optimal LR schedules and their scaling with horizon T can be obtained exactly, including explicit phase distinctions and outperformance over standard baselines. The combination of optimal-control analysis with numerical verification, together with the extensions to batch-size ramps and momentum, constitutes a clear strength for understanding scaling laws within this solvable model. The mapping to real architectures is presented as suggestive rather than rigorous.

major comments (2)
  1. [§3] §3 (optimal control derivation): the reduction from the continuous-time dynamics to the explicit polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase requires the explicit Hamiltonian, adjoint equations, and boundary conditions; without these the dependence of ξ and δ on the power-law eigenvalue spectrum remains opaque and the claim of an analytical solution cannot be verified.
  2. [§5] §5 (hard-phase analysis): the statement that annealing occurs over a vanishing fraction of steps as T→∞ is load-bearing for the warmup-stable-decay characterization, yet the scaling of the annealing interval with T is not derived explicitly from the optimal-control problem; a concrete asymptotic calculation is needed to confirm the limit.
minor comments (2)
  1. The abstract and introduction use η_T^*(t) without first defining the horizon T; a brief parenthetical definition would improve readability.
  2. Figure captions for the ResNet and GPT-2 comparisons should state the precise metric (e.g., test loss or accuracy) and the number of independent runs used to generate the curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive suggestions. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (optimal control derivation): the reduction from the continuous-time dynamics to the explicit polynomial form η_T^*(t) ≃ T^{-ξ} (1-t/T)^δ in the easy phase requires the explicit Hamiltonian, adjoint equations, and boundary conditions; without these the dependence of ξ and δ on the power-law eigenvalue spectrum remains opaque and the claim of an analytical solution cannot be verified.

    Authors: We agree that the explicit optimal-control steps would make the derivation more transparent. In the revised manuscript we will add the Hamiltonian, adjoint equations, and boundary conditions to §3 (with a short appendix if needed), explicitly tracing how the power-law eigenvalue spectrum determines the exponents ξ and δ in the polynomial schedule. This will allow direct verification of the analytical solution. revision: yes

  2. Referee: [§5] §5 (hard-phase analysis): the statement that annealing occurs over a vanishing fraction of steps as T→∞ is load-bearing for the warmup-stable-decay characterization, yet the scaling of the annealing interval with T is not derived explicitly from the optimal-control problem; a concrete asymptotic calculation is needed to confirm the limit.

    Authors: We thank the referee for highlighting this point. We will insert a concrete asymptotic calculation in §5 that derives the scaling of the annealing interval with T from the optimal-control problem, confirming that the interval vanishes as T → ∞ (specifically as T^{-α} for α > 0 determined by the model parameters). This will rigorously support the warmup-stable-decay characterization. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained via optimal control on model dynamics

full rationale

The central claims derive optimal LR schedules analytically via optimal control theory applied to the SGD dynamics of the power-law random feature model with quadratic loss and assumed eigenvalue spectrum. The easy/hard phase distinction and explicit forms (polynomial decay or warmup-stable-decay) follow directly from solving the resulting control problem; numerical optimization cross-checks are performed inside the same model. No load-bearing step reduces to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The mapping to ResNet/transformer behavior is presented as suggestive only and does not support the theoretical results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The derivation rests on the random-feature model having a power-law spectrum, quadratic loss, and SGD dynamics that admit an exact mean-field description; optimal control theory is invoked as a standard tool. No new entities are postulated.

free parameters (1)
  • exponents ξ and δ
    Stated to depend on feature spectrum and task; their specific values are not derived parameter-free in the abstract.
axioms (2)
  • domain assumption The random feature model with power-law eigenvalues and quadratic loss admits an exact description of SGD dynamics.
    Invoked to enable closed-form optimal control solution.
  • standard math Optimal control theory yields the globally optimal schedule for the given finite-horizon objective.
    Standard assumption when applying Pontryagin's principle or similar methods.

pith-pipeline@v0.9.0 · 5660 in / 1560 out tokens · 32556 ms · 2026-05-16T06:56:07.439810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 5 internal anchors

  1. [1]

    Scaling optimal lr across token horizons.arXiv preprint arXiv:2409.19913,

    Bjorck, J., Benhaim, A., Chaudhary, V ., Wei, F., and Song, X. Scaling optimal lr across token horizons.arXiv preprint arXiv:2409.19913,

  2. [2]

    Bordelon, B., Noci, L., Li, M

    URL https:// openreview.net/forum?id=WPI2vbkAl3Q. Bordelon, B., Noci, L., Li, M. B., Hanin, B., and Pehle- van, C. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit.arXiv preprint arXiv:2309.16620,

  3. [3]

    Bordelon, B., Atanasov, A., and Pehlevan, C

    URL https: //openreview.net/forum?id=nbOY1OmtRc. Bordelon, B., Atanasov, A., and Pehlevan, C. How feature learning can improve neural scaling laws.Journal of Statistical Mechanics: Theory and Experiment, 2025(8): 084002,

  4. [4]

    Carrasco-Davis, R., Mas ´ıs, J., and Saxe, A. M. Meta- learning strategies through value maximization in neural networks.arXiv preprint arXiv:2310.19919,

  5. [5]

    Optimal linear decay learning rate schedules and further refinements.arXiv preprint arXiv:2310.07831,

    Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. Optimal linear decay learning rate schedules and further refinements.arXiv preprint arXiv:2310.07831,

  6. [6]

    C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J

    Dey, N., Zhang, B. C., Noci, L., Li, M., Bordelon, B., Bergsma, S., Pehlevan, C., Hanin, B., and Hestness, J. Don’t be lazy: Completep enables compute-efficient deep transformers.arXiv preprint arXiv:2505.01618,

  7. [7]

    A., Novak, R., Liu, P

    Everett, K., Xiao, L., Wortsman, M., Alemi, A. A., Novak, R., Liu, P. J., Gur, I., Sohl-Dickstein, J., Kaelbling, L. P., Lee, J., et al. Scaling exponents across parameterizations and optimizers.arXiv preprint arXiv:2407.05872,

  8. [8]

    Dimension-adapted momentum outscales sgd

    Ferbach, D., Everett, K., Gidel, G., Paquette, E., and Pa- quette, C. Dimension-adapted momentum outscales sgd. arXiv preprint arXiv:2505.16098,

  9. [9]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    9 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., et al. Training compute- optimal large language models. Hu, S., Tu, Y ., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y ....

  10. [10]

    Scaling Laws for Neural Language Models

    URL https: //kellerjordan.github.io/posts/muon/. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  11. [11]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 22,

  12. [12]

    Liu, Z., Liu, Y ., Gore, J., and Tegmark, M

    URL https: //openreview.net/forum?id=PH7sdEanXP. Liu, Z., Liu, Y ., Gore, J., and Tegmark, M. Neural thermo- dynamic laws for large language model training.arXiv preprint arXiv:2505.10559,

  13. [13]

    Seesaw: Accelerating training by balancing learning rate and batch size scheduling.arXiv preprint arXiv:2510.14717,

    Meterez, A., Morwani, D., Wu, J., Oncescu, C.-A., Pehlevan, C., and Kakade, S. Seesaw: Accelerating training by balancing learning rate and batch size scheduling.arXiv preprint arXiv:2510.14717,

  14. [14]

    and Mori, F

    Mignacco, F. and Mori, F. A statistical physics framework for optimal learning.arXiv preprint arXiv:2507.07907,

  15. [15]

    S., and Mignacco, F

    Mori, F., Mannelli, S. S., and Mignacco, F. Optimal pro- tocols for continual learning via statistical physics and control theory.Journal of Statistical Mechanics: Theory and Experiment, 2025(8):084004,

  16. [16]

    The deep boot- strap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127,

    Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep boot- strap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127,

  17. [17]

    E., and Saxe, A

    Njaradi, V ., Carrasco-Davis, R., Latham, P. E., and Saxe, A. Optimal learning rate schedule for balancing effort and performance.arXiv preprint arXiv:2601.07830,

  18. [18]

    G., Pennington, J., and Agar- wala, A

    Qiu, S., Xiao, L., Wilson, A. G., Pennington, J., and Agar- wala, A. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks.arXiv preprint arXiv:2507.02119,

  19. [19]

    Don't Decay the Learning Rate, Increase the Batch Size

    Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V . Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

  20. [20]

    SOAP: Improving and Stabilizing Shampoo using Adam

    Vyas, N., Morwani, D., Zhao, R., Kwun, M., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321,

  21. [21]

    Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192,

    Wen, K., Li, Z., Wang, J., Hall, D., Liang, P., and Ma, T. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192,

  22. [22]

    arXiv preprint arXiv:2203.03466 , year=

    Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs v: Tuning large neural networks 10 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

  23. [23]

    Feature learn- ing in infinite-depth neural networks

    Yang, G., Yu, D., Zhu, C., and Hayou, S. Feature learn- ing in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning,

  24. [24]

    η(t) =    1fort < t s , 1−t/T 1−ts/T δ fort > t s , (82) with δ= 2b−1 and T−t s ∼T 1−(b−a)/(2b−1). The edge boundary t∗ is determined by the condition χ(t∗) = 1, yielding χ(t∗) = Z T t∗ dt′η(t′) = Z T t∗ dt′ 1−t ′/T 1−t s/T δ =T ϵ −δ(1−τ ∗)δ+1 = 1,(83) 17 Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model where we hav...