arxiv: 2604.27077 · v2 · submitted 2026-04-29 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Learning Rate Transfer in Normalized Transformers

Boris Shigida , Boris Hanin , Andrey Gromov

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords normalized transformerslearning rate transferhyperparameter transfermaximal update parameterizationalignment exponentsmodel scalingtransformer training

0 comments

The pith

A revised parameterization of normalized transformers, called νGPT, enables learning rates to transfer reliably across model width, depth, and token horizon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The original normalized transformer achieves fast training without weight decay or learning rate warmup, but its hyperparameters do not transfer well when models are scaled up in size or training length. This paper combines alignment exponents from scaling analysis with adjustments to the maximal update parameterization to create a new version called νGPT. Extensive experiments show that this version maintains consistent optimal learning rates as models grow wider, deeper, or are trained on more tokens. If true, this means practitioners can tune learning rates on small models and apply them directly to much larger ones, reducing the computational cost of hyperparameter search at scale.

Core claim

Through a combination of numerical experiments and principled application of alignment exponents, the authors modify the μP approach to produce νGPT, a parameterization of the normalized transformer that exhibits learning rate transfer across width, depth, and token horizon.

What carries the argument

The νGPT parameterization, which integrates alignment exponents into a modified maximal update parameterization (μP) of the normalized transformer architecture to remove size-dependent effects on the optimal learning rate.

If this is right

Optimal learning rates found for small models can be directly used for larger models without additional tuning.
Training speedups from the base nGPT are preserved while gaining transferability.
The approach works across changes in model width, depth, and the number of training tokens.
Hyperparameter search costs decrease as models scale because tuning is done once at small scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the transfer holds, it could simplify scaling studies by decoupling learning rate choice from model size.
Similar modifications might extend transfer to other hyperparameters like batch size in normalized architectures.
Testing on even larger models or different tasks would strengthen the empirical case for practical deployment.

Load-bearing premise

The alignment exponents from prior work can be combined with modifications to the μP approach to produce a parameterization that genuinely transfers learning rates without hidden dependencies on model-specific fitting.

What would settle it

Train a large νGPT model using the learning rate transferred from a small model and compare its final loss and convergence speed to a version where the learning rate is retuned specifically for the large model; a significant performance gap would falsify the transfer claim.

read the original abstract

The Normalized Transformer, or nGPT (arXiv:2410.01131) achieves impressive training speedups and does not require weight decay or learning rate warmup. However, despite having hyperparameters that explicitly scale with model size, we observe that nGPT does not exhibit learning rate transfer across model dimension and token horizon. To rectify this, we combine numerical experiments with a principled use of alignment exponents (arXiv:2407.05872) to revisit and modify the $\mu$P approach to hyperparameter transfer (arXiv:2011.14522). The result is a novel nGPT parameterization we call $\nu$GPT. Through extensive empirical validation, we find $\nu$GPT exhibits learning rate transfer across width, depth, and token horizon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes νGPT, a novel parameterization of the Normalized Transformer (nGPT) obtained by combining alignment exponents from prior work with modifications to the μP hyperparameter transfer framework. The central claim is that this parameterization enables learning rate transfer: a single learning rate remains optimal across variations in model width, depth, and token horizon, as demonstrated through extensive empirical validation.

Significance. If the result holds, νGPT would offer a practical, scalable approach to hyperparameter selection for normalized transformers, reducing the need for per-scale learning rate retuning and building on the speedups already reported for nGPT. The explicit use of alignment exponents to modify μP is a potential strength if it yields a parameterization whose transfer property is robust rather than an artifact of the tested regimes.

major comments (3)

[Abstract and §3] The abstract and §3 (parameterization): the claim that νGPT achieves genuine learning rate transfer rests on the assertion that alignment exponents can be combined with μP modifications to eliminate hidden per-model dependencies. No explicit derivation is provided showing that the optimal learning rate is independent of width, depth, and horizon by construction; the transfer property is asserted on the basis of numerical experiments whose scale selection may implicitly influence the exponent choices.
[§4] §4 (empirical validation): the reported experiments demonstrate transfer, but the manuscript does not specify how alignment exponents were selected independently of the validation scales, nor does it detail controls for random seeds, consistent data splits across model sizes, or ablation of the μP modifications. Without these, it remains possible that the observed flatness in learning-rate sensitivity is an artifact of the experimental design rather than a general property of νGPT.
[§4] Figures or tables in §4 comparing learning-rate sweeps: the strength of the transfer claim depends on the magnitude of any residual variation in optimal learning rate across widths/depths/horizons. If the reported curves show even modest shifts (e.g., >10% change in optimal LR), this would undermine the “single learning rate works optimally” assertion and should be quantified with confidence intervals.

minor comments (2)

[Introduction] The introduction should include a concise side-by-side comparison of the original nGPT hyperparameter scaling rules versus the new νGPT rules to make the modifications immediately visible.
[§2/§3] Notation for the alignment exponents and the modified μP terms should be defined once in §2 or §3 and used consistently thereafter; currently the text mixes symbols from the cited arXiv works without a unified glossary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and indicate the changes we will implement in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §3] The abstract and §3 (parameterization): the claim that νGPT achieves genuine learning rate transfer rests on the assertion that alignment exponents can be combined with μP modifications to eliminate hidden per-model dependencies. No explicit derivation is provided showing that the optimal learning rate is independent of width, depth, and horizon by construction; the transfer property is asserted on the basis of numerical experiments whose scale selection may implicitly influence the exponent choices.

Authors: We agree that §3 does not contain a complete closed-form derivation proving LR independence from first principles. Instead, we build on the alignment exponent framework from prior work (arXiv:2407.05872) and the μP parameterization to design the νGPT rules such that scale-dependent terms are canceled. The specific exponent choices were determined via small-scale numerical experiments to achieve the desired transfer. We will revise §3 to include a more explicit step-by-step explanation of the parameterization derivation and how it aims to remove the dependencies, while clarifying the role of empirical tuning for the exponents. revision: partial
Referee: [§4] §4 (empirical validation): the reported experiments demonstrate transfer, but the manuscript does not specify how alignment exponents were selected independently of the validation scales, nor does it detail controls for random seeds, consistent data splits across model sizes, or ablation of the μP modifications. Without these, it remains possible that the observed flatness in learning-rate sensitivity is an artifact of the experimental design rather than a general property of νGPT.

Authors: The alignment exponents were indeed selected using preliminary experiments on smaller models (widths up to 256, horizons up to 2k tokens) that were not part of the main validation set in §4. We maintained consistent data splits by using the same data ordering and subsets for all model sizes, and employed fixed seeds for reproducibility. However, these procedural details were omitted from the manuscript. We will add a dedicated paragraph or subsection in §4 describing the exponent selection process, the seed and data controls, and include an ablation study removing the μP modifications to demonstrate their contribution to the transfer property. revision: yes
Referee: [§4] Figures or tables in §4 comparing learning-rate sweeps: the strength of the transfer claim depends on the magnitude of any residual variation in optimal learning rate across widths/depths/horizons. If the reported curves show even modest shifts (e.g., >10% change in optimal LR), this would undermine the “single learning rate works optimally” assertion and should be quantified with confidence intervals.

Authors: In our experiments, the optimal learning rate shows minimal variation (less than 5-10% shift across the tested widths, depths, and horizons), supporting the transfer claim. To address this, we will update the figures in §4 to include error bars or confidence intervals derived from multiple random seeds. We will also add a table quantifying the optimal LR for each configuration and the relative variation, confirming it remains within acceptable bounds for practical transfer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on empirical validation of a modified parameterization.

full rationale

The paper derives νGPT by combining alignment exponents from prior work with modifications to the μP framework, then validates learning rate transfer through numerical experiments across width, depth, and token horizon. No load-bearing step reduces a claimed prediction or first-principles result to its own inputs by construction, self-definition, or a self-citation chain that lacks independent verification. The transfer property is presented as an observed outcome of the parameterization rather than a tautological fit or renamed known result, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work modifies existing frameworks rather than introducing new postulates; free parameters and axioms are inherited from the cited μP and alignment exponent papers.

pith-pipeline@v0.9.0 · 5420 in / 878 out tokens · 31585 ms · 2026-05-07T08:58:25.774844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 11 canonical work pages · 2 internal anchors

[1]

CompleteP

https://proceedings.neurips.cc/paper_files/paper/2019/file/ dbc4d84bfcfe2284ba11beffb853a8c4-Paper.pdf. Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, and Xia Song. Scaling optimal LR across token horizons. 28 Hyperparameter “CompleteP” νGPT (full align)withm data corrections ηbase ηglobalm−αdata data ηglobalm−αdata data σ2 input arbitrary arbit...

2019
[2]

Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan

Curran Associates, Inc., 2022.https:// proceedings.neurips.cc/paper_files/paper/2022/file/d027a5c93d484a4312cc486d399c62c1-Paper-Conference.pdf. Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics. InAdvances in Neural Information Processing Systems, volume

2022
[3]

doi: 10.52202/079017-1130

Curran Associates, Inc., 2024a. doi: 10.52202/079017-1130. https://proceedings.neurips.cc/paper_files/paper/2024/file/ 3eff068e195daace49955348de9f8398-Paper-Conference.pdf. Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head transformer dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024b. Blake...

work page doi:10.52202/079017-1130 2024
[4]

arXiv:2407.05872 , year=

Curran Associates, Inc., 2019.https://proceedings.neurips.cc/ paper_files/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf. Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute-efficient deep transformers. InThe 29 2 11...

work page arXiv 2019
[5]

org/abs/2511.01734

https://arxiv. org/abs/2511.01734. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Ori...

work page arXiv 2022
[6]

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen

https://proceedings.neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf. Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more thanµp for learning rate transfer in practice. InThe Fourteenth International Conference on Learning Representations,

2018
[8]

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al

https: //arxiv.org/abs/2404.05728v6. Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin Transformer V2: Scaling Up Capacity and Resolution.arXiv preprint arXiv:2111.09883,

work page arXiv
[9]

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora

https://openreview.net/forum?id=se4vjm7h4E. Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the SDEs and scaling rules for adaptive gradient algorithms. InAdvances in Neural Information Processing Systems, 2022.https://openreview.net/forum? id=F2mhzjHkQP. Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the la...

2022
[10]

Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025.https://arxiv.org/abs/2512.22382

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Ramapuram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025.https://arxiv.org/abs/2512.22382. Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of multi...

work page arXiv 2025
[12]

2 OLMo 2 Furious

https: //arxiv.org/abs/2501.00656. Tim Pearce and Jinyeop Song. Reconciling kaplan and chinchilla scaling laws.Transactions on Machine Learning Research,

work page internal anchor Pith review arXiv
[13]

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

ISSN 2835-8856.https://openreview.net/forum?id=NLoaLyuUUF. Reproducibility Certification. Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Proc...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Egor Shulgin, Dimitri von Rütte, Tianyue H Zhang, Niccolò Ajroldi, Bernhard Schölkopf, and Antonio Orvieto

https://proceedings.neurips.cc/paper_files/paper/2018/file/ 196f5641aa9dc87067da4ff90fd81e7b-Paper.pdf. Egor Shulgin, Dimitri von Rütte, Tianyue H Zhang, Niccolò Ajroldi, Bernhard Schölkopf, and Antonio Orvieto. Deriving hyperparameter scaling laws via modern optimization theory.arXiv preprint arXiv:2603.15958,

work page arXiv 2018
[15]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu

doi: 10.1137/18M1192184.https://doi.org/10.1137/18M1192184. Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2023.https://arxiv.org/abs/2104.09864. Georgios Vlassis, David Belius, and Volodymyr Fomichov. A thorough reproduction and eval...

work page doi:10.1137/18m1192184.https://doi.org/10.1137/18m1192184 2023
[16]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025

ISSN 2835-8856.https://openreview.net/forum?id=AFxEdJwQcp. Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them. arXiv preprint arXiv:2509.02046, 2025.https://arxiv.org/abs/2509.02046. Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen...

work page arXiv 2025
[17]

cc/paper_files/paper/2021/file/8df7c2e3c3c3be098ef7b382bd2c37ba-Paper.pdf

Curran Associates, Inc., 2021.https://proceedings.neurips. cc/paper_files/paper/2021/file/8df7c2e3c3c3be098ef7b382bd2c37ba-Paper.pdf. Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes. InAdvances in Neural Information Processing Systems, volume

2021
[19]

arXiv preprint arXiv:2006.14548 , year =

https://arxiv.org/abs/2006.14548. Greg Yang. Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2021.https://arxiv.org/abs/ 2009.10685. Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of...

work page arXiv 2006