pith. sign in

arxiv: 2605.21486 · v1 · pith:XXH3CVNXnew · submitted 2026-05-20 · 💻 cs.LG · cond-mat.dis-nn· cs.AI· stat.ML

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

Pith reviewed 2026-05-21 04:58 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncs.AIstat.ML
keywords hyperparameter transfermaximal update parameterizationembedding layerlearning rateAdamWscaling lawstraining stabilitylarge language models
0
0 comments X

The pith

The primary reason maximal update parameterization offers better hyperparameter transfer than standard parameterization is its higher learning rate on the embedding layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework with three metrics to evaluate hyperparameter transfer across scales: the quality of scaling law fits, robustness to extrapolation errors, and the asymptotic loss penalty from parameterization choice. It then uses this to analyze why μP outperforms SP for learning rate transfer in AdamW training of language models. The analysis reveals that the main advantage stems from the embedding layer receiving a learning rate scaled up by the model width in μP, whereas SP uses a much smaller rate that bottlenecks training. Experiments show that simply raising the embedding learning rate in SP by this width factor removes instabilities and matches μP's transfer performance. Additionally, weight decay is found to improve scaling law fits but decrease extrapolation robustness under fixed token-per-parameter conditions.

Core claim

We find that the overwhelming benefit of μP relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match μP dramatically smooths out training while improving hyperparameter transfer.

What carries the argument

The embedding-layer learning rate, which serves as a training bottleneck in standard parameterization but is scaled with width in maximal update parameterization.

If this is right

  • Raising the embedding layer learning rate by a factor of model width in standard parameterization eliminates training instabilities and matches the hyperparameter transfer of μP.
  • Weight decay improves the fit quality of scaling laws for optimal hyperparameters.
  • Weight decay reduces the robustness of hyperparameter extrapolation in the fixed token-per-parameter regime.
  • The three proposed metrics enable quantitative comparison of how well different parameterizations support scale-invariant training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted learning rate adjustments for specific layers like embeddings could be a simpler alternative to adopting full μP for improving transfer in practice.
  • The results may apply to other model components or optimizers where certain layers act as scale-dependent bottlenecks.
  • This work highlights the value of layer-wise analysis in understanding parameterization effects at large scales.

Load-bearing premise

The comprehensive ablations successfully pinpoint the embedding layer learning rate as the dominant factor distinguishing μP and SP without interference from other training elements or choices.

What would settle it

A direct test would be to train models of varying widths using standard parameterization but with the embedding layer learning rate multiplied by the width, and verify whether training becomes stable and hyperparameter transfer quality approaches that of μP.

Figures

Figures reproduced from arXiv: 2605.21486 by Dayal Singh Kalra, Maissam Barkeshli.

Figure 1
Figure 1. Figure 1: Computing the three transfer metrics for µP. (a) Loss vs. log learning rate ν, with star marking the optimum ν ∗ (n), (b) Joint fit of the loss model (Equation (6), dashed lines), with a low predictability error E = 0.0034 , (c) Loss curves in the normalized coordinates (Equation (8)), with κ = −2.640 indicating robust transfer. (d-f) Scaling laws for optimal loss L ∗ (n), optimal log-learning-rate ν ∗ (n)… view at source ↗
Figure 2
Figure 2. Figure 2: Embedding layer learning rate is the critical difference between SP and µP. Loss vs. log learning rate ν across widths for: (a) SP with Θ(1/n) embedding learning rate; (b) SP modified to use Θ(1) embedding learning rate (SP+Embd); (c) µP modified to use Θ(1/n) embedding learning rate (µP-Embd). Speeding up the embedding in SP eliminates training instabilities and yields smooth, µP-like curves, while slowin… view at source ↗
Figure 3
Figure 3. Figure 3: Transfer metrics for parameterizations interpolating between SP and µP. Parameterizations with ‘+’ denote incremental changes from SP towards µP, while ‘−’ denotes changes from µP towards SP. Green and red regions indicate desirable and undesirable regimes, respectively. The orange arrow highlights SP+Embd (SP with Θ(1) embedding learning rate), which matches µP across all three metrics, suggesting the emb… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Reducing the embedding learning rate from [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of weight decay on the three transfer metrics in the fixed-step (top) and fixed TPP (bottom) settings. In the fixed-step setting, weight decay improves predictability error E at the cost of asymptotic performance R(∞). In the TPP setting, stable parameterizations achieve near-zero R(∞), and landscape predictability trends are similar, but robustness κ degrades as weight decay increases. learning rat… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Interpolated loss curves for µP across widths with loss filtering threshold f = 1.35. Raw observed points (circles) and the fitted smoothing spline (solid lines) are shown for each width. (b) Per-width curvature fits for µP. For each width, we show the interpolated loss curve (solid line) and the fitted centered quadratic L(ν) = Lmin + 1 2H(n)(ν − ν ∗ ) 2 (dashed line). Filtering, Smoothing and Interpo… view at source ↗
Figure 7
Figure 7. Figure 7: Fitted β as a function of the lower bound βmin for two cases. Left: a degenerate case (µP) where the step function better fits the observed trend, and we adopt the optimal solution with β > β∗ min. Right: a genuine case (µP-Attn) where β monotonically increases with βmin, suggesting that the fitted small β can be trusted. is fit in linear space, since ν is already in log space, with β ≥ 0 enforced to ensur… view at source ↗
Figure 8
Figure 8. Figure 8: Transfer metrics for µP with weight decay λ = 0.006. Repeated [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Transfer metrics for SP with weight decay [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Transfer metrics for SP+Embd with weight decay [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Transfer metrics for µP-Embd with weight decay λ = 0.006. −4 −2 0 2 ν 3.0 3.5 4.0 4.5 L ( n; ν ) (a) Loss Transfer Curves | SP+Attn | λ = 0.001 −4 −2 0 2 ν 3 4 5 L ( n; ν ) (b) Loss Fit | SP+Attn | λ = 0.001 E = 0.0098 1 ν˜ 1 2 3 4 ˜L ( n; ˜ν ) (c) Scaled Transfer Curves | SP+Attn | λ = 0.001 κ = 0.611 width 128 256 512 768 1024 1536 2048 2 7 2 8 2 9 2 10 2 11 n 3.0 3.5 L ∗ ( n ) (d) Loss Scaling Law | SP… view at source ↗
Figure 12
Figure 12. Figure 12: Transfer metrics for SP+Attn with weight decay [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Transfer metrics for µP-Attn with weight decay λ = 0.006. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Transfer metrics for SP+LN with weight decay [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Transfer metrics for µP-LN with weight decay λ = 0.001. −4 −2 0 2 ν 3.0 3.5 4.0 4.5 L ( n; ν ) (a) Loss Transfer Curves | SP+Last | λ = 0.001 −4 −2 0 2 ν 3 4 L ( n; ν ) (b) Loss Fit | SP+Last | λ = 0.001 E = 0.0104 −10 2 −10 −1 1 10 10 2 ν˜ 2 4 ˜L ( n; ˜ν ) (c) Scaled Transfer Curves | SP+Last | λ = 0.001 κ = −2.308 width 128 256 512 768 1024 1536 2048 2 7 2 8 2 9 2 10 2 11 n 3.0 3.5 L ∗ ( n ) (d) Loss Sc… view at source ↗
Figure 16
Figure 16. Figure 16: Transfer metrics for SP+Last with weight decay [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Transfer metrics for µP-Last with weight decay λ = 0.006. C.2 Transfer Metric Phase Diagrams [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Transfer metrics for parameterizations interpolating between SP and µP. Parameterizations with ‘+’ denote incremental changes from SP towards µP, while ‘−’ denotes changes from µP towards SP. Green and red regions indicate desirable and undesirable regimes, respectively [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: CNN experiments on CIFAR with Adam. Training loss as a function of log learning rate ν for four parameterizations across widths. In SP, the optimal learning rate drifts with width, while µP shows substantially less drift. Increasing the learning rate of the input-facing layer in SP (SP+Embd) largely removes this drift and places the optimum in a similar region to µP. By contrast, changing only the last-la… view at source ↗
Figure 20
Figure 20. Figure 20: Loss curves for µP in the compute-optimal setting (20 TPP) under two weight decay scalings. (a) Standard µP convention η · λ = Θ(1): ν ∗ (n) shifts to the left with increasing width. (b) Corrected convention η · λ = Θ(1/n 2 ): ν ∗ (n) doesn’t vary much, but the shape of the loss curves around the minimum changes across widths [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Learning rate warmup is essential for observing reliable learning rate transfer in [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Effect of freezing the embedding layer in [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗
read the original abstract

Hyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($\mu$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $\mu$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $\mu$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $\mu$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a framework with three metrics to quantify hyperparameter transfer: scaling-law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. It then uses a series of ablations to investigate why Maximal Update Parameterization (μP) yields better learning-rate transfer than Standard Parameterization (SP) under AdamW. The central empirical claim is that the dominant advantage of μP arises simply from scaling the embedding-layer learning rate by model width; in SP this layer acts as a bottleneck inducing instabilities, and correcting only the embedding LR (multiplying by width) largely recovers μP’s stability and transfer benefits. Weight decay is reported to improve scaling-law fits while reducing extrapolation robustness in the fixed tokens-per-parameter regime.

Significance. If the isolation of the embedding-layer effect holds, the result is significant for LLM training practice: it suggests that a single, simple modification to SP can approximate most of μP’s hyperparameter-transfer gains without adopting the full reparameterization. The three-metric framework itself supplies a reproducible, falsifiable way to compare transfer across parameterizations. The work is strengthened by its comprehensive ablations and explicit focus on AdamW training dynamics, which are directly relevant to current large-model pipelines.

major comments (2)
  1. [§4] §4 (ablations isolating embedding LR): the central claim that 'the overwhelming benefit ... arises simply from maximizing the learning rate of the embedding layer' requires explicit verification that the modified-SP condition applies only the width-scaled embedding LR while leaving all other layer-specific scalings (attention, MLP, output) identical to SP. If any μP-style rules for non-embedding layers are inadvertently active, the attribution to embedding LR alone is confounded and the 'simply from' conclusion is not load-bearing.
  2. [results on weight decay] Fixed tokens-per-parameter experiments (reported in results on weight decay and robustness): the claim that weight decay 'hurts the robustness of the extrapolation' must be shown to be insensitive to the precise tokens-per-parameter schedule; otherwise the interaction between embedding-LR correction and the token schedule could still confound the reported stability gains.
minor comments (2)
  1. [§3] Notation for the three metrics should be introduced with explicit equations or pseudocode in §3 so that readers can reproduce the scaling-law fit quality and robustness calculations without ambiguity.
  2. [figures] Figure captions for the training-curve plots should state the exact width values, batch size, and whether error bars represent standard deviation over seeds or runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the presentation of our results. We address each major comment in turn below.

read point-by-point responses
  1. Referee: §4 (ablations isolating embedding LR): the central claim that 'the overwhelming benefit ... arises simply from maximizing the learning rate of the embedding layer' requires explicit verification that the modified-SP condition applies only the width-scaled embedding LR while leaving all other layer-specific scalings (attention, MLP, output) identical to SP. If any μP-style rules for non-embedding layers are inadvertently active, the attribution to embedding LR alone is confounded and the 'simply from' conclusion is not load-bearing.

    Authors: In our ablation experiments described in §4, the modified-SP variant scales only the embedding layer's learning rate by the model width, while strictly adhering to standard parameterization (SP) rules for all other components, including attention, MLP, and output layers. No μP-specific scaling is applied to non-embedding layers. This setup is explicitly stated in the methods and ablation descriptions. We will add a more prominent clarification in the revised §4 to emphasize this isolation and prevent any potential misinterpretation. revision: yes

  2. Referee: Fixed tokens-per-parameter experiments (reported in results on weight decay and robustness): the claim that weight decay 'hurts the robustness of the extrapolation' must be shown to be insensitive to the precise tokens-per-parameter schedule; otherwise the interaction between embedding-LR correction and the token schedule could still confound the reported stability gains.

    Authors: Our weight decay experiments were performed in the fixed tokens-per-parameter regime using a consistent schedule across all runs. While we did not explicitly vary the schedule in the current manuscript, the observed reduction in extrapolation robustness with weight decay is consistent with the training dynamics under AdamW and the embedding LR correction. To fully address this, we will include a brief sensitivity analysis or additional discussion in the revision to confirm that the effect is not an artifact of the specific schedule chosen. revision: partial

Circularity Check

0 steps flagged

Empirical framework and ablations are self-contained with independent metrics

full rationale

The paper defines its three metrics for hyperparameter transfer (scaling law fit quality, extrapolation robustness, asymptotic loss penalty) directly from experimental outcomes without any derivation that reduces them to fitted parameters or prior self-citations. The central finding on embedding-layer learning rate is obtained via ablations that compare SP and μP variants; these comparisons do not invoke equations or uniqueness theorems that loop back to the inputs by construction. No load-bearing step relies on self-citation chains or ansatzes smuggled from prior author work. The analysis remains externally falsifiable through the reported training runs and is therefore scored as having no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard empirical practices in deep learning rather than new axioms or invented entities; no free parameters are introduced beyond those implicit in scaling-law fitting.

pith-pipeline@v0.9.0 · 5766 in / 1259 out tokens · 38834 ms · 2026-05-21T04:58:05.765382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 6 internal anchors

  1. [1]

    On the origin of neural scaling laws: From random graphs to natural language.arXiv preprint arXiv:2601.10684, 2026

    Maissam Barkeshli, Alberto Alfarano, and Andrey Gromov. On the origin of neural scaling laws: from random graphs to natural language, 2026. URLhttps://arxiv.org/abs/2601.10684

  2. [2]

    Power lines: Scaling laws for weight decay and batch size in LLM pre-training

    Shane Bergsma, Nolan Simran Dey, Gurpreet Gosal, Gavia Gray, Daria Soboleva, and Joel Hestness. Power lines: Scaling laws for weight decay and batch size in LLM pre-training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum? id=bFXbLQzRoZ

  3. [3]

    Scaling optimal LR across token horizons

    Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, and Xia Song. Scaling optimal LR across token horizons. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=WYL4eFLcxG

  4. [4]

    Self-consistent dynamical field theory of kernel evolution in wide neural networks

    Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum? id=sipwrPCrIS

  5. [5]

    Infinite limits of multi-head transformer dynamics

    Blake Bordelon, Hamza Tahir Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=p0BBKhD5aI

  6. [6]

    Depthwise hyperparam- eter transfer in residual networks: Dynamics and scaling limit

    Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, and Cengiz Pehlevan. Depthwise hyperparam- eter transfer in residual networks: Dynamics and scaling limit. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=KZJehvRKGD

  7. [7]

    DeepSeek-AI, :, Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y . K. Li, Wenfeng Liang, Fangyun Lin, A....

  8. [8]

    Don’t be lazy: Completep enables compute-efficient deep transformers,

    Nolan Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers,

  9. [9]

    URLhttps://arxiv.org/abs/2505.01618

  10. [10]

    Sparse maximal update parameterization: A holistic approach to sparse training dynamics

    Nolan Simran Dey, Shane Bergsma, and Joel Hestness. Sparse maximal update parameterization: A holistic approach to sparse training dynamics. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=OWmu3QOa0O

  11. [11]

    Scaling exponents across parameterizations and optimizers

    Katie E Everett, Lechao Xiao, Mitchell Wortsman, Alexander A Alemi, Peter J Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=0ksNeD1SJT

  12. [12]

    Understanding the mechanisms of fast hyperpa- rameter transfer.arXiv preprint arXiv:2512.22768, 2025

    Nikhil Ghosh, Denny Wu, and Alberto Bietti. Understanding the mechanisms of fast hyperparameter transfer, 2025. URLhttps://arxiv.org/abs/2512.22768. 11

  13. [13]

    A loss curvature perspective on training instabilities of deep learning models

    Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. InInternational Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=OcKMT-36vUs

  14. [14]

    $\boldsymbol{\mu}\mathbf{P^2}$: Effective sharpness aware minimization requires layerwise perturbation scaling

    Moritz Haas, Jin Xu, V olkan Cevher, and Leena Chennuru Vankadara. $\boldsymbol{\mu}\mathbf{P^2}$: Effective sharpness aware minimization requires layerwise perturbation scaling. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/ forum?id=pR5g1bBqoV

  15. [15]

    A proof of learning rate transfer under $\mu$p

    Soufiane Hayou. A proof of learning rate transfer under $\mu$p. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026. URLhttps://openreview.net/forum?id=sZHpz3DHPj

  16. [16]

    Optimal embedding learning rate in llms: The effect of vocabulary size,

    Soufiane Hayou and Liyuan Liu. Optimal embedding learning rate in llms: The effect of vocabulary size,

  17. [17]

    URLhttps://arxiv.org/abs/2506.15025

  18. [18]

    Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

    Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. InThe Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=aVh9KRZdRk

  19. [19]

    An empirical analysis of compute-optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

  20. [20]

    MiniCPM: Unveiling the potential of small language models with scalable training strategies

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Xinrong Zhang, Zhen Leng Thai, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, dahai li, Zhiyuan Liu, and Maosong Sun. MiniCPM: Unveiling the potential of small language models with...

  21. [21]

    Hyperparameter Transfer with Mixture-of-Expert Layers

    Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture- of-expert layers, 2026. URLhttps://arxiv.org/abs/2601.20205

  22. [22]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bern- stein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan. github.io/posts/muon/

  23. [23]

    Why warmup the learning rate? underlying mechanisms and improvements

    Dayal Singh Kalra and Maissam Barkeshli. Why warmup the learning rate? underlying mechanisms and improvements. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=NVl4SAmz5c

  24. [24]

    Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos

    Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URLhttps://openreview.net/forum?id=VZN0irKnl0

  25. [25]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  26. [26]

    Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri

    Dhruva Karkada, Daniel J. Korchinski, Andres Nava, Matthieu Wyart, and Yasaman Bahri. Symmetry in language statistics shapes the geometry of model representations, 2026. URL https://arxiv.org/ abs/2602.15029

  27. [27]

    nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts

    Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/finetuning medium-sized gpts. https://github.com/karpathy/nanoGPT, 2022

  28. [28]

    Weight decay may matter more than µp for learning rate transfer in practice

    Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than µp for learning rate transfer in practice. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PvTxIdZc1E

  29. [29]

    Cifar-100 (canadian institute for advanced research)

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research). URLhttp://www.cs.toronto.edu/~kriz/cifar.html. 12

  30. [30]

    Predictable scale: Part i–optimal hyperparameter scaling law in large language model pretraining

    Houyi Li, Wenzhen Zheng, Jingcheng Hu, Qiufeng Wang, Hanshan Zhang, Zili Wang, Shijie Xuyang, Yuantao Fan, Shuigeng Zhou, Xiangyu Zhang, and Daxin Jiang. Predictable scale: Part i – optimal hyperparameter scaling law in large language model pretraining, 2025. URL https://arxiv.org/abs/ 2503.04715

  31. [31]

    Adaptive optimization in the $\infty$-width limit

    Etai Littwin and Greg Yang. Adaptive optimization in the $\infty$-width limit. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URL https://openreview.net/forum?id= zgVDqw9ZUES

  32. [32]

    The Llama 3 Herd of Models

    Team Llama. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

  33. [33]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=Bkg6RiCqY7

  34. [34]

    µ-parametrization for mixture of experts, 2025

    Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, and Jakub Krajewski. µ-parametrization for mixture of experts, 2025. URLhttps://arxiv.org/abs/2508.09752

  35. [35]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW

  36. [36]

    Super consistency of neural network landscapes and learning rate transfer

    Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=rgwhJ7INtZ

  37. [37]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2025

  38. [38]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URLhttps://openreview.net/forum?id=n...

  39. [39]

    Resolving discrepan- cies in compute-optimal scaling of language models

    Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, and Yair Carmon. Resolving discrepan- cies in compute-optimal scaling of language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4fSSqpk1sM

  40. [40]

    Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales, 2025

    Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson. Hyperparameter transfer enables consistent gains of matrix-preconditioned optimizers across scales, 2025. URL https://arxiv.org/ abs/2512.05620

  41. [41]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lina, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li...

  42. [42]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID: 160025533

  43. [43]

    Roberts, Sho Yaida, and Boris Hanin.Frontmatter, page i–iv

    Daniel A. Roberts, Sho Yaida, and Boris Hanin.Frontmatter, page i–iv. Cambridge University Press, 2022

  44. [44]

    On the infinite width limit of neural networks with a standard parameterization.arXiv preprint arXiv:2001.07301, 2020

    Jascha Sohl-Dickstein, Roman Novak, Samuel S Schoenholz, and Jaehoon Lee. On the infinite width limit of neural networks with a standard parameterization.arXiv preprint arXiv:2001.07301, 2020

  45. [45]

    (how) can transformers predict pseudo-random numbers? InForty-second International Conference on Machine Learning, 2025

    Tao Tao, Darshil Doshi, Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. (how) can transformers predict pseudo-random numbers? InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=asDx9sPAUN

  46. [46]

    On feature learning in structured state space models

    Leena Chennuru Vankadara, Jin Xu, Moritz Haas, and V olkan Cevher. On feature learning in structured state space models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  47. [47]

    URLhttps://openreview.net/forum?id=aQv5AbN1wF. 13

  48. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL ht...

  49. [50]

    Meta-principled family of hyperparameter scaling strategies, 2022

    Sho Yaida. Meta-principled family of hyperparameter scaling strategies, 2022. URL https://arxiv. org/abs/2210.04909

  50. [51]

    Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/yang21c.html

  51. [52]

    Tuning large neural networks via zero-shot hyperparameter transfer

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021. URLhttps://openre...

  52. [53]

    Tensor programs VI: Feature learning in infinite depth neural networks

    Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs VI: Feature learning in infinite depth neural networks. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=17pVDnpwwl. 14 A Experimental Details We pre-trained GPT-style Transformers on FineWeb-Edu [ 36], building on the nanoGPT c...