pith. sign in

arxiv: 2409.04777 · v4 · pith:3R2LV3LQnew · submitted 2024-09-07 · 💻 cs.LG · math.OC

Optimization Hyper-parameter Laws for Large Language Models

Pith reviewed 2026-05-23 20:42 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords optimization hyper-parameterslearning rate scheduleslarge language modelsscaling lawsstochastic differential equationstraining divergence detection
0
0 comments X

The pith

Opt-Laws predict final training loss for learning-rate schedules from small-scale experiments across model and data sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Optimization Hyper-parameter Laws to select dynamic hyper-parameters such as learning-rate schedules for large language model training. Existing scaling laws address model and data size but leave schedule choice unresolved. Opt-Laws derive convergence and escape features from SDE analyses that forecast final loss as a joint function of schedule, model size, and data size. These features allow schedule candidates to be ranked using only small runs, with the framework showing strong transfer to held-out and out-of-family cases.

Core claim

Opt-Laws, grounded in SDE-based convergence and escape analyses, produce interpretable features that predict final training loss across scales and enable reliable pre-selection of learning-rate schedules from small-scale experiments.

What carries the argument

Opt-Laws framework that extracts convergence and escape features from stochastic differential equation models of optimization to predict final loss.

If this is right

  • Learning-rate schedules can be narrowed to a small set of candidates without running full-scale training.
  • The best schedule family can be identified correctly even when the test configuration lies outside the families seen during feature fitting.
  • Training runs that will diverge can be flagged early with an F1 score of 0.92 using the same features.
  • Schedule ranking achieves 94 percent Top-2 hit rate on held-out configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feature extraction might extend to other dynamic hyper-parameters such as batch-size or optimizer momentum schedules.
  • If the features remain stable under distribution shift, the method could reduce wasted compute on divergent or suboptimal large runs.
  • The approach supplies a concrete route to test whether optimization dynamics admit scale-invariant descriptors beyond loss curves.

Load-bearing premise

The SDE-derived convergence and escape features stay predictive of final loss when model scale and data size change.

What would settle it

Apply the fitted Opt-Laws to a held-out model size or data volume larger than any training example and check whether the top-ranked schedule actually yields the lowest observed loss.

Figures

Figures reproduced from arXiv: 2409.04777 by Kim-Chuan Toh, Kuangyu Ding, Shuicheng Yan, Tianwen Wei, Xingyu Xie.

Figure 1
Figure 1. Figure 1: Contour plots of predicted perplexity, which is the exponential of the predicted training loss, versus warmup steps and peak LR for different token quantities (3B, 6B, 10B, 30B) from the RedPajama-v2 dataset. 3.3 Understanding Training Phenomena through Opt-Laws With the proposed Opt-Laws, many previously interesting observations in practical LLM training become comprehensible. 3.3.1 Influence of Warmup St… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the crite￾rion for predicting train￾ing divergence using a lin￾ear warmup and cooldown schedule. The areas S1 (where the learning rate is below the threshold ηL) and S2 (where it exceeds ηL) are compared. A ratio S1/S2 > 1 suggests stable training, while a ratio < 1 indicates likely divergence. For general non-convex optimization problems minx f(x), optimization theory typically dictates th… view at source ↗
Figure 2
Figure 2. Figure 2: Smoothed final training loss across various combinations of training parameters, including model sizes from 8 × 0.001B to 8 × 0.3B MoEs, peak LRs from 1e-3 to 1.5e-2, warmup steps from 128 to 6000, and data sizes of 10B and 30B tokens. Each grid point represents the loss for a specific parameter set. Divergent training runs were assigned a loss of 7, reflecting the typical plateau observed in practice. con… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of a typical LR schedule comprising four phases: warmup, de￾cay, plateau, and cooldown. This framework encom￾passes most LR schedules used in LLM training as special cases. We use this example to demonstrate the selection of the hyper￾parameters ac and ae in Opt-Laws. For the convergence component in Opt-Laws, based on the LR schedule in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of actual training outcomes (left) and loss predictions generated by Opt-Laws (right) for a common LR schedule pattern with linear warmup and cooldown. In regions where R(ηmax, a1, N, S) > 1, the divergence indicator from Eqn. (4), the predicted loss is set to 7 to signify training failure. The average relative error between the predicted and actual losses is within 0.5%, demonstrating the accur… view at source ↗
Figure 6
Figure 6. Figure 6: Training loss comparison for 8×0.1B and 8×0.6B MoEs under three LR schedules (linear decay, cosine decay, and constant followed by linear decay) across pre￾training scales of 3B, 10B, and 100B tokens. Initial disparities in training loss at 3B tokens diminish with increased data volume, but larger model sizes slow the convergence of these gaps, highlighting the interplay between model scale and data volume… view at source ↗
Figure 7
Figure 7. Figure 7: Training loss curves for three distinct LR schedules applied to the 8x0.6B model on the 100B token dataset. Despite substantial differences in the schedules, the losses converge to nearly the same final value, in line with Opt-Laws’s predictions, demonstrating its effectiveness in accurately forecasting training outcomes. 6.3 Opt-Laws for Continual Training This experiment explores the application of gener… view at source ↗
Figure 8
Figure 8. Figure 8: Loss curves for continual training under five different learning rate schedules with weak data distribution shift. (a) shows the LR schedules, while (b) depicts the resulting loss trajectories. Despite the variance in LR schedules, final losses con￾verge closely, suggesting a limited impact of the LR schedule on final performance when the data distribution shift is weak [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
read the original abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations, correctly identify the best-performing schedule family in all five evaluated out-of-family settings, and detect training divergence with F1 = 0.92.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Optimization Hyper-parameter Laws (Opt-Laws), a framework grounded in SDE-based convergence and escape analyses that extracts interpretable features to predict final LLM training loss as a function of learning-rate schedule, model size N, and data size D. It claims this enables reliable pre-selection of near-optimal schedules from small-scale experiments, reporting a 94% Top-2 hit rate on held-out configurations, correct identification of the best schedule family in all five out-of-family settings, and F1=0.92 for detecting divergence.

Significance. If the SDE-derived features remain predictive under scale changes, the framework would offer a practical method to reduce the cost of schedule tuning for large models by transferring insights from small-scale runs, complementing existing scaling laws for N and D.

major comments (2)
  1. [Empirical Evaluation] Empirical results (held-out and out-of-family evaluations): the 94% Top-2 hit rate and out-of-family success are measured on configurations whose (N, D) ranges relative to the fitting regime are not stated; without explicit demonstration that the feature-to-loss mapping extrapolates beyond the training scales, the central pre-selection utility from small-scale runs is unsupported.
  2. [SDE Analysis] SDE convergence/escape analysis section: the mapping from SDE-derived features to predicted loss is presented as remaining predictive across scales, yet no ablation or scaling test is shown that varies N and D independently while holding the fitted coefficients fixed, leaving the extrapolation assumption untested.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'across model scales' is used without accompanying quantitative ranges or a table summarizing the (N, D) values in fitting vs. test sets.
  2. [Framework] Notation: the precise functional form relating the extracted convergence and escape features to final loss is not written as an explicit equation in the main text, complicating reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit scale information and scaling tests. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Empirical Evaluation] Empirical results (held-out and out-of-family evaluations): the 94% Top-2 hit rate and out-of-family success are measured on configurations whose (N, D) ranges relative to the fitting regime are not stated; without explicit demonstration that the feature-to-loss mapping extrapolates beyond the training scales, the central pre-selection utility from small-scale runs is unsupported.

    Authors: We agree the (N, D) ranges for held-out and out-of-family evaluations were not explicitly compared to the fitting regime. The revised manuscript will add a table listing exact N and D values for fitting, held-out, and out-of-family sets, confirming that held-out points include scales up to 2x larger than the fitting maximum. Out-of-family tests use the same scale band as fitting. We will also report a small additional experiment applying the fitted mapping to one larger held-out scale to directly support the extrapolation claim for pre-selection utility. revision: yes

  2. Referee: [SDE Analysis] SDE convergence/escape analysis section: the mapping from SDE-derived features to predicted loss is presented as remaining predictive across scales, yet no ablation or scaling test is shown that varies N and D independently while holding the fitted coefficients fixed, leaving the extrapolation assumption untested.

    Authors: The SDE features are normalized by construction to remove explicit N and D dependence, which is why we expected the linear mapping to transfer. However, the referee is correct that no explicit test holding the regression coefficients fixed while independently varying N and D appears in the current manuscript. We will add this ablation in revision: coefficients fitted on the smallest scale subset will be applied to predict loss on the largest scale subset, with the resulting error reported to test the extrapolation assumption. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation appears self-contained

full rationale

The provided abstract and description frame Opt-Laws as deriving interpretable convergence/escape features from SDE analyses to predict final loss across scales, with empirical validation on held-out and out-of-family settings (94% Top-2 hit rate, F1=0.92). No equations, self-citations, or fitted-parameter renamings are visible that would reduce any prediction to its inputs by construction. The central claim rests on the predictive power of SDE-derived features rather than tautological re-use of the target loss or scale-specific fits. This is the expected non-finding when no load-bearing reduction can be exhibited from quoted text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that SDE convergence and escape features transfer across scales; this is a domain assumption with no independent evidence supplied in the abstract.

axioms (1)
  • domain assumption SDE-based convergence and escape analyses accurately model LLM optimization dynamics
    Stated as the grounding for the interpretable features in the abstract.

pith-pipeline@v0.9.0 · 5703 in / 1163 out tokens · 25847 ms · 2026-05-23T20:42:57.339009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 16 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Besiroglu, E

    T. Besiroglu, E. Erdil, M. Barnett, and J. You. Chinchilla Scaling: A replication attempt. arXiv preprint arXiv:2404.10102,

  4. [4]

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, et al. DeepSeek LLM: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954,

  5. [5]

    46 Opt-La ws DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, et al. DeepSeek-V2: A strong, eco- nomical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

  6. [6]

    Ding and K.-C

    K. Ding and K.-C. Toh. Stochastic Bregman Subgradient Methods for Nonsmooth Nonconvex Optimization Problems. arXiv preprint arXiv:2404.17386,

  7. [7]

    K. Ding, N. Xiao, and K.-C. Toh. Adam-family methods with decoupled weight decay in deep learning. arXiv preprint arXiv:2310.08858,

  8. [8]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783,

  9. [9]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

  10. [10]

    Gess and S

    B. Gess and S. Kassing. Convergence rates for momentum stochastic gradient descent with noise of machine learning type.arXiv preprint arXiv:2302.03550,

  11. [11]

    Grimmer, K

    B. Grimmer, K. Shu, and A. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps.arXiv preprint arXiv:2403.14045,

  12. [12]

    Y. Guo, J. Fu, H. Zhang, D. Zhao, and Y. Shen. Efficient continual pre-training by mitigating the stability gap.arXiv preprint arXiv:2406.14833,

  13. [13]

    Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459,

  14. [14]

    Hägele, E

    A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392,

  15. [15]

    Scaling Laws for Transfer

    47 Opt-La ws D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

  16. [16]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  17. [17]

    S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

  18. [18]

    Ibrahim, B

    A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. Anthony, T. Lesort, et al. Simple and scalable strategies to continually pre-train large language models.arXiv preprint arXiv:2403.08763,

  19. [19]

    B. Isik, N. Ponomareva, H. Hazimeh, D. Paparas, S. Vassilvitskii, and S. Koyejo. Scal- ing laws for downstream task performance of large language models.arXiv preprint arXiv:2402.04177,

  20. [20]

    Three Factors Influencing Minima in SGD

    S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, et al. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623,

  21. [21]

    H. Jin, W. Wei, X. Wang, W. Zhang, and Y. Wu. Rethinking learning rate tuning in the era of large language models.arXiv preprint arXiv:2309.08859,

  22. [22]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  23. [23]

    Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models. arXiv preprint arXiv:2302.03241,

  24. [24]

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large- batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

  25. [25]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  26. [26]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

  27. [27]

    K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu. Full parameter fine-tuning for large language models with limited resources.arXiv preprint arXiv:2306.09782,

  28. [29]

    Maulen-Soto, J

    R. Maulen-Soto, J. Fadili, H. Attouch, and P. Ochs. An SDE perspective on stochastic inertial gradient dynamics with time-dependent viscosity and geometric damping.arXiv preprint arXiv:2407.04562,

  29. [30]

    Reuse, don't retrain: A recipe for continued pretraining of language models, 2024

    J. Parmar, S. Satheesh, M. Patwary, M. Shoeybi, and B. Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

  30. [31]

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  31. [32]

    Rotaru, F

    T. Rotaru, F. Glineur, and P. Patrinos. Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions. arXiv preprint arXiv:2406.17506,

  32. [33]

    Shrivastava, D

    D. Shrivastava, D. Kocetkov, H. de Vries, D. Bahdanau, and T. Scholak. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998,

  33. [34]

    R. M. Soto, J. Fadili, and H. Attouch. An SDE perspective on stochastic convex optimization. arXiv preprint arXiv:2207.02750,

  34. [35]

    Tu and R

    S. Tu and R. Boczar. An elementary proof of anti-concentration for degree two non-negative gaussian polynomials. arXiv preprint arXiv:2301.05992,

  35. [36]

    T. Wei, B. Zhu, L. Zhao, C. Cheng, B. Li, W. Lü, et al. Skywork-MoE: A deep dive into train- ing techniques for mixture-of-experts language models.arXiv preprint arXiv:2406.06563,

  36. [37]

    N. Xiao, X. Hu, and K.-C. Toh. Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization.arXiv preprint arXiv:2307.10053,

  37. [38]

    X. Xie, Z. Lin, K.-C. Toh, and P. Zhou. LoCo: Low-bit communication adaptor for large-scale model training. arXiv preprint arXiv:2407.04480, 2024a. X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024b. Z. Xie, I. Sato, ...

  38. [39]

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

  39. [40]

    51 Opt-La ws L. Zhao, T. Wei, L. Zeng, C. Cheng, L. Yang, P. Cheng, et al. LongSkywork: A training recipe for efficiently extending context length in large language models.arXiv preprint arXiv:2406.00605,