Optimization Hyper-parameter Laws for Large Language Models

Kim-Chuan Toh; Kuangyu Ding; Shuicheng Yan; Tianwen Wei; Xingyu Xie

arxiv: 2409.04777 · v4 · pith:3R2LV3LQnew · submitted 2024-09-07 · 💻 cs.LG · math.OC

Optimization Hyper-parameter Laws for Large Language Models

Xingyu Xie , Kuangyu Ding , Shuicheng Yan , Kim-Chuan Toh , Tianwen Wei This is my paper

Pith reviewed 2026-05-23 20:42 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords optimization hyper-parameterslearning rate scheduleslarge language modelsscaling lawsstochastic differential equationstraining divergence detection

0 comments

The pith

Opt-Laws predict final training loss for learning-rate schedules from small-scale experiments across model and data sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Optimization Hyper-parameter Laws to select dynamic hyper-parameters such as learning-rate schedules for large language model training. Existing scaling laws address model and data size but leave schedule choice unresolved. Opt-Laws derive convergence and escape features from SDE analyses that forecast final loss as a joint function of schedule, model size, and data size. These features allow schedule candidates to be ranked using only small runs, with the framework showing strong transfer to held-out and out-of-family cases.

Core claim

Opt-Laws, grounded in SDE-based convergence and escape analyses, produce interpretable features that predict final training loss across scales and enable reliable pre-selection of learning-rate schedules from small-scale experiments.

What carries the argument

Opt-Laws framework that extracts convergence and escape features from stochastic differential equation models of optimization to predict final loss.

If this is right

Learning-rate schedules can be narrowed to a small set of candidates without running full-scale training.
The best schedule family can be identified correctly even when the test configuration lies outside the families seen during feature fitting.
Training runs that will diverge can be flagged early with an F1 score of 0.92 using the same features.
Schedule ranking achieves 94 percent Top-2 hit rate on held-out configurations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feature extraction might extend to other dynamic hyper-parameters such as batch-size or optimizer momentum schedules.
If the features remain stable under distribution shift, the method could reduce wasted compute on divergent or suboptimal large runs.
The approach supplies a concrete route to test whether optimization dynamics admit scale-invariant descriptors beyond loss curves.

Load-bearing premise

The SDE-derived convergence and escape features stay predictive of final loss when model scale and data size change.

What would settle it

Apply the fitted Opt-Laws to a held-out model size or data volume larger than any training example and check whether the top-ranked schedule actually yields the lowest observed loss.

Figures

Figures reproduced from arXiv: 2409.04777 by Kim-Chuan Toh, Kuangyu Ding, Shuicheng Yan, Tianwen Wei, Xingyu Xie.

**Figure 1.** Figure 1: Contour plots of predicted perplexity, which is the exponential of the predicted training loss, versus warmup steps and peak LR for different token quantities (3B, 6B, 10B, 30B) from the RedPajama-v2 dataset. 3.3 Understanding Training Phenomena through Opt-Laws With the proposed Opt-Laws, many previously interesting observations in practical LLM training become comprehensible. 3.3.1 Influence of Warmup St… view at source ↗

**Figure 3.** Figure 3: Illustration of the criterion for predicting training divergence using a linear warmup and cooldown schedule. The areas S1 (where the learning rate is below the threshold ηL) and S2 (where it exceeds ηL) are compared. A ratio S1/S2 > 1 suggests stable training, while a ratio < 1 indicates likely divergence. For general non-convex optimization problems minx f(x), optimization theory typically dictates th… view at source ↗

**Figure 2.** Figure 2: Smoothed final training loss across various combinations of training parameters, including model sizes from 8 × 0.001B to 8 × 0.3B MoEs, peak LRs from 1e-3 to 1.5e-2, warmup steps from 128 to 6000, and data sizes of 10B and 30B tokens. Each grid point represents the loss for a specific parameter set. Divergent training runs were assigned a loss of 7, reflecting the typical plateau observed in practice. con… view at source ↗

**Figure 4.** Figure 4: Illustration of a typical LR schedule comprising four phases: warmup, decay, plateau, and cooldown. This framework encompasses most LR schedules used in LLM training as special cases. We use this example to demonstrate the selection of the hyperparameters ac and ae in Opt-Laws. For the convergence component in Opt-Laws, based on the LR schedule in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of actual training outcomes (left) and loss predictions generated by Opt-Laws (right) for a common LR schedule pattern with linear warmup and cooldown. In regions where R(ηmax, a1, N, S) > 1, the divergence indicator from Eqn. (4), the predicted loss is set to 7 to signify training failure. The average relative error between the predicted and actual losses is within 0.5%, demonstrating the accur… view at source ↗

**Figure 6.** Figure 6: Training loss comparison for 8×0.1B and 8×0.6B MoEs under three LR schedules (linear decay, cosine decay, and constant followed by linear decay) across pretraining scales of 3B, 10B, and 100B tokens. Initial disparities in training loss at 3B tokens diminish with increased data volume, but larger model sizes slow the convergence of these gaps, highlighting the interplay between model scale and data volume… view at source ↗

**Figure 7.** Figure 7: Training loss curves for three distinct LR schedules applied to the 8x0.6B model on the 100B token dataset. Despite substantial differences in the schedules, the losses converge to nearly the same final value, in line with Opt-Laws’s predictions, demonstrating its effectiveness in accurately forecasting training outcomes. 6.3 Opt-Laws for Continual Training This experiment explores the application of gener… view at source ↗

**Figure 8.** Figure 8: Loss curves for continual training under five different learning rate schedules with weak data distribution shift. (a) shows the LR schedules, while (b) depicts the resulting loss trajectories. Despite the variance in LR schedules, final losses converge closely, suggesting a limited impact of the LR schedule on final performance when the data distribution shift is weak [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

read the original abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations, correctly identify the best-performing schedule family in all five evaluated out-of-family settings, and detect training divergence with F1 = 0.92.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Opt-Laws packages SDE convergence/escape features into loss predictors for LR schedules and reports solid held-out hit rates, but the cross-scale extrapolation claim rests on unshown evidence.

read the letter

The core offering is a framework that extracts interpretable features from SDE-based convergence and escape analyses, then uses them to predict final training loss as a function of schedule, model size, and data size. This lets you run small experiments and pre-select schedules that should work at larger scale. The empirical side shows 94% top-2 accuracy on held-out configurations, correct identification of the best schedule family in all five out-of-family tests, and F1=0.92 for spotting divergence. Those numbers are the clearest positive signal in the abstract. The grounding in prior SDE work is also a plus; it gives the features some theoretical motivation rather than pure curve-fitting. What is actually new is the specific combination of those features into scale-aware predictors for dynamic hyper-parameters. The paper does a reasonable job stating the practical goal and the reported performance. The soft spot is exactly the one the stress-test note flags. Nothing in the provided abstract or description shows that the same feature-to-loss mapping was tested or holds when both N and D move well beyond the fitting regime. The held-out results could easily be within the same scale band, which would make the pre-selection claim much weaker. Without equations, data splits, or error analysis visible, it is also impossible to check whether the SDE features are doing real work or whether the predictor is mostly learning from the small-scale runs themselves. This is for groups that already run many small-scale ablations and want a more systematic way to rank schedules before committing big compute. A reader focused on practical training efficiency could extract value if the full derivations and scale ranges check out. It is worth sending to a serious referee because the target problem is real and the reported hit rates are high enough to merit checking the details, even if the extrapolation piece needs more evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces Optimization Hyper-parameter Laws (Opt-Laws), a framework grounded in SDE-based convergence and escape analyses that extracts interpretable features to predict final LLM training loss as a function of learning-rate schedule, model size N, and data size D. It claims this enables reliable pre-selection of near-optimal schedules from small-scale experiments, reporting a 94% Top-2 hit rate on held-out configurations, correct identification of the best schedule family in all five out-of-family settings, and F1=0.92 for detecting divergence.

Significance. If the SDE-derived features remain predictive under scale changes, the framework would offer a practical method to reduce the cost of schedule tuning for large models by transferring insights from small-scale runs, complementing existing scaling laws for N and D.

major comments (2)

[Empirical Evaluation] Empirical results (held-out and out-of-family evaluations): the 94% Top-2 hit rate and out-of-family success are measured on configurations whose (N, D) ranges relative to the fitting regime are not stated; without explicit demonstration that the feature-to-loss mapping extrapolates beyond the training scales, the central pre-selection utility from small-scale runs is unsupported.
[SDE Analysis] SDE convergence/escape analysis section: the mapping from SDE-derived features to predicted loss is presented as remaining predictive across scales, yet no ablation or scaling test is shown that varies N and D independently while holding the fitted coefficients fixed, leaving the extrapolation assumption untested.

minor comments (2)

[Abstract] Abstract: the phrase 'across model scales' is used without accompanying quantitative ranges or a table summarizing the (N, D) values in fitting vs. test sets.
[Framework] Notation: the precise functional form relating the extracted convergence and escape features to final loss is not written as an explicit equation in the main text, complicating reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for explicit scale information and scaling tests. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Empirical Evaluation] Empirical results (held-out and out-of-family evaluations): the 94% Top-2 hit rate and out-of-family success are measured on configurations whose (N, D) ranges relative to the fitting regime are not stated; without explicit demonstration that the feature-to-loss mapping extrapolates beyond the training scales, the central pre-selection utility from small-scale runs is unsupported.

Authors: We agree the (N, D) ranges for held-out and out-of-family evaluations were not explicitly compared to the fitting regime. The revised manuscript will add a table listing exact N and D values for fitting, held-out, and out-of-family sets, confirming that held-out points include scales up to 2x larger than the fitting maximum. Out-of-family tests use the same scale band as fitting. We will also report a small additional experiment applying the fitted mapping to one larger held-out scale to directly support the extrapolation claim for pre-selection utility. revision: yes
Referee: [SDE Analysis] SDE convergence/escape analysis section: the mapping from SDE-derived features to predicted loss is presented as remaining predictive across scales, yet no ablation or scaling test is shown that varies N and D independently while holding the fitted coefficients fixed, leaving the extrapolation assumption untested.

Authors: The SDE features are normalized by construction to remove explicit N and D dependence, which is why we expected the linear mapping to transfer. However, the referee is correct that no explicit test holding the regression coefficients fixed while independently varying N and D appears in the current manuscript. We will add this ablation in revision: coefficients fitted on the smallest scale subset will be applied to predict loss on the largest scale subset, with the resulting error reported to test the extrapolation assumption. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation appears self-contained

full rationale

The provided abstract and description frame Opt-Laws as deriving interpretable convergence/escape features from SDE analyses to predict final loss across scales, with empirical validation on held-out and out-of-family settings (94% Top-2 hit rate, F1=0.92). No equations, self-citations, or fitted-parameter renamings are visible that would reduce any prediction to its inputs by construction. The central claim rests on the predictive power of SDE-derived features rather than tautological re-use of the target loss or scale-specific fits. This is the expected non-finding when no load-bearing reduction can be exhibited from quoted text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that SDE convergence and escape features transfer across scales; this is a domain assumption with no independent evidence supplied in the abstract.

axioms (1)

domain assumption SDE-based convergence and escape analyses accurately model LLM optimization dynamics
Stated as the grounding for the interpretable features in the abstract.

pith-pipeline@v0.9.0 · 5703 in / 1163 out tokens · 25847 ms · 2026-05-23T20:42:57.339009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 16 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Besiroglu, E

T. Besiroglu, E. Erdil, M. Barnett, and J. You. Chinchilla Scaling: A replication attempt. arXiv preprint arXiv:2404.10102,

work page arXiv
[4]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, et al. DeepSeek LLM: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

46 Opt-La ws DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, et al. DeepSeek-V2: A strong, eco- nomical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Ding and K.-C

K. Ding and K.-C. Toh. Stochastic Bregman Subgradient Methods for Nonsmooth Nonconvex Optimization Problems. arXiv preprint arXiv:2404.17386,

work page arXiv
[7]

K. Ding, N. Xiao, and K.-C. Toh. Adam-family methods with decoupled weight decay in deep learning. arXiv preprint arXiv:2310.08858,

work page arXiv
[8]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Gess and S

B. Gess and S. Kassing. Convergence rates for momentum stochastic gradient descent with noise of machine learning type.arXiv preprint arXiv:2302.03550,

work page arXiv
[11]

Grimmer, K

B. Grimmer, K. Shu, and A. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps.arXiv preprint arXiv:2403.14045,

work page arXiv
[12]

Y. Guo, J. Fu, H. Zhang, D. Zhao, and Y. Shen. Efficient continual pre-training by mitigating the stability gap.arXiv preprint arXiv:2406.14833,

work page arXiv
[13]

Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459,

work page arXiv
[14]

Hägele, E

A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392,

work page arXiv
[15]

Scaling Laws for Transfer

47 Opt-La ws D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Ibrahim, B

A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. Anthony, T. Lesort, et al. Simple and scalable strategies to continually pre-train large language models.arXiv preprint arXiv:2403.08763,

work page arXiv
[19]

B. Isik, N. Ponomareva, H. Hazimeh, D. Paparas, S. Vassilvitskii, and S. Koyejo. Scal- ing laws for downstream task performance of large language models.arXiv preprint arXiv:2402.04177,

work page arXiv
[20]

Three Factors Influencing Minima in SGD

S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, et al. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

H. Jin, W. Wei, X. Wang, W. Zhang, and Y. Wu. Rethinking learning rate tuning in the era of large language models.arXiv preprint arXiv:2309.08859,

work page arXiv
[22]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[23]

Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models. arXiv preprint arXiv:2302.03241,

work page arXiv
[24]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large- batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu. Full parameter fine-tuning for large language models with limited resources.arXiv preprint arXiv:2306.09782,

work page arXiv
[29]

Maulen-Soto, J

R. Maulen-Soto, J. Fadili, H. Attouch, and P. Ochs. An SDE perspective on stochastic inertial gradient dynamics with time-dependent viscosity and geometric damping.arXiv preprint arXiv:2407.04562,

work page arXiv
[30]

Reuse, don't retrain: A recipe for continued pretraining of language models, 2024

J. Parmar, S. Satheesh, M. Patwary, M. Shoeybi, and B. Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

work page arXiv
[31]

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Rotaru, F

T. Rotaru, F. Glineur, and P. Patrinos. Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions. arXiv preprint arXiv:2406.17506,

work page arXiv
[33]

Shrivastava, D

D. Shrivastava, D. Kocetkov, H. de Vries, D. Bahdanau, and T. Scholak. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998,

work page arXiv
[34]

R. M. Soto, J. Fadili, and H. Attouch. An SDE perspective on stochastic convex optimization. arXiv preprint arXiv:2207.02750,

work page arXiv
[35]

Tu and R

S. Tu and R. Boczar. An elementary proof of anti-concentration for degree two non-negative gaussian polynomials. arXiv preprint arXiv:2301.05992,

work page arXiv
[36]

T. Wei, B. Zhu, L. Zhao, C. Cheng, B. Li, W. Lü, et al. Skywork-MoE: A deep dive into train- ing techniques for mixture-of-experts language models.arXiv preprint arXiv:2406.06563,

work page arXiv
[37]

N. Xiao, X. Hu, and K.-C. Toh. Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization.arXiv preprint arXiv:2307.10053,

work page arXiv
[38]

X. Xie, Z. Lin, K.-C. Toh, and P. Zhou. LoCo: Low-bit communication adaptor for large-scale model training. arXiv preprint arXiv:2407.04480, 2024a. X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024b. Z. Xie, I. Sato, ...

work page arXiv
[39]

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

51 Opt-La ws L. Zhao, T. Wei, L. Zeng, C. Cheng, L. Yang, P. Cheng, et al. LongSkywork: A training recipe for efficiently extending context length in large language models.arXiv preprint arXiv:2406.00605,

work page arXiv

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Besiroglu, E

T. Besiroglu, E. Erdil, M. Barnett, and J. You. Chinchilla Scaling: A replication attempt. arXiv preprint arXiv:2404.10102,

work page arXiv

[4] [4]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, et al. DeepSeek LLM: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

46 Opt-La ws DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, et al. DeepSeek-V2: A strong, eco- nomical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Ding and K.-C

K. Ding and K.-C. Toh. Stochastic Bregman Subgradient Methods for Nonsmooth Nonconvex Optimization Problems. arXiv preprint arXiv:2404.17386,

work page arXiv

[7] [7]

K. Ding, N. Xiao, and K.-C. Toh. Adam-family methods with decoupled weight decay in deep learning. arXiv preprint arXiv:2310.08858,

work page arXiv

[8] [8]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, et al. The LLaMA 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, and B. Neyshabur. Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Gess and S

B. Gess and S. Kassing. Convergence rates for momentum stochastic gradient descent with noise of machine learning type.arXiv preprint arXiv:2302.03550,

work page arXiv

[11] [11]

Grimmer, K

B. Grimmer, K. Shu, and A. Wang. Accelerated objective gap and gradient norm convergence for gradient descent via long steps.arXiv preprint arXiv:2403.14045,

work page arXiv

[12] [12]

Y. Guo, J. Fu, H. Zhang, D. Zhao, and Y. Shen. Efficient continual pre-training by mitigating the stability gap.arXiv preprint arXiv:2406.14833,

work page arXiv

[13] [13]

Z. Guo, Y. Xu, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family.arXiv preprint arXiv:2112.03459,

work page arXiv

[14] [14]

Hägele, E

A. Hägele, E. Bakouch, A. Kosson, L. B. Allal, L. Von Werra, and M. Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392,

work page arXiv

[15] [15]

Scaling Laws for Transfer

47 Opt-La ws D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

S. Hu, Y. Tu, X. Han, C. He, G. Cui, X. Long, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies.arXiv preprint arXiv:2404.06395,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Ibrahim, B

A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. Anthony, T. Lesort, et al. Simple and scalable strategies to continually pre-train large language models.arXiv preprint arXiv:2403.08763,

work page arXiv

[19] [19]

B. Isik, N. Ponomareva, H. Hazimeh, D. Paparas, S. Vassilvitskii, and S. Koyejo. Scal- ing laws for downstream task performance of large language models.arXiv preprint arXiv:2402.04177,

work page arXiv

[20] [20]

Three Factors Influencing Minima in SGD

S. Jastrzębski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, et al. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

H. Jin, W. Wei, X. Wang, W. Zhang, and Y. Wu. Rethinking learning rate tuning in the era of large language models.arXiv preprint arXiv:2309.08859,

work page arXiv

[22] [22]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[23] [23]

Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models. arXiv preprint arXiv:2302.03241,

work page arXiv

[24] [24]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large- batch training for deep learning: Generalization gap and sharp minima.arXiv preprint arXiv:1609.04836,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu. Full parameter fine-tuning for large language models with limited resources.arXiv preprint arXiv:2306.09782,

work page arXiv

[28] [29]

Maulen-Soto, J

R. Maulen-Soto, J. Fadili, H. Attouch, and P. Ochs. An SDE perspective on stochastic inertial gradient dynamics with time-dependent viscosity and geometric damping.arXiv preprint arXiv:2407.04562,

work page arXiv

[29] [30]

Reuse, don't retrain: A recipe for continued pretraining of language models, 2024

J. Parmar, S. Satheesh, M. Patwary, M. Shoeybi, and B. Catanzaro. Reuse, don’t retrain: A recipe for continued pretraining of language models.arXiv preprint arXiv:2407.07263,

work page arXiv

[30] [31]

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [32]

Rotaru, F

T. Rotaru, F. Glineur, and P. Patrinos. Exact worst-case convergence rates of gradient descent: a complete analysis for all constant stepsizes over nonconvex and convex functions. arXiv preprint arXiv:2406.17506,

work page arXiv

[32] [33]

Shrivastava, D

D. Shrivastava, D. Kocetkov, H. de Vries, D. Bahdanau, and T. Scholak. Repofusion: Training code models to understand your repository.arXiv preprint arXiv:2306.10998,

work page arXiv

[33] [34]

R. M. Soto, J. Fadili, and H. Attouch. An SDE perspective on stochastic convex optimization. arXiv preprint arXiv:2207.02750,

work page arXiv

[34] [35]

Tu and R

S. Tu and R. Boczar. An elementary proof of anti-concentration for degree two non-negative gaussian polynomials. arXiv preprint arXiv:2301.05992,

work page arXiv

[35] [36]

T. Wei, B. Zhu, L. Zhao, C. Cheng, B. Li, W. Lü, et al. Skywork-MoE: A deep dive into train- ing techniques for mixture-of-experts language models.arXiv preprint arXiv:2406.06563,

work page arXiv

[36] [37]

N. Xiao, X. Hu, and K.-C. Toh. Convergence guarantees for stochastic subgradient methods in nonsmooth nonconvex optimization.arXiv preprint arXiv:2307.10053,

work page arXiv

[37] [38]

X. Xie, Z. Lin, K.-C. Toh, and P. Zhou. LoCo: Low-bit communication adaptor for large-scale model training. arXiv preprint arXiv:2407.04480, 2024a. X. Xie, P. Zhou, H. Li, Z. Lin, and S. Yan. Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024b. Z. Xie, I. Sato, ...

work page arXiv

[38] [39]

A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [40]

51 Opt-La ws L. Zhao, T. Wei, L. Zeng, C. Cheng, L. Yang, P. Cheng, et al. LongSkywork: A training recipe for efficiently extending context length in large language models.arXiv preprint arXiv:2406.00605,

work page arXiv