Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Binghui Li; Kai Shen; Mingze Wang; Shuchen Zhu; Shu Zhong; Yuxin Fang

arxiv: 2605.26895 · v1 · pith:NDSOJ2IGnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· stat.ML

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Mingze Wang , Shuchen Zhu , Yuxin Fang , Binghui Li , Kai Shen , Shu Zhong This is my paper

Pith reviewed 2026-06-29 19:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords scale vectorslayer normalizationLLM pre-trainingoptimizationweight decaypre-norm architectureexpressivitypreconditioning

0 comments

The pith

Scale vectors in Pre-Norm LLMs improve optimization by preconditioning linear mappings without increasing expressivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scale vectors, despite making up a tiny fraction of parameters, are essential for successful LLM pre-training because removing them causes clear degradation. In Pre-Norm architectures they do not expand what the model can represent; instead they create a self-amplifying preconditioning effect that eases the optimization of the linear layers that follow. The work separates Input-Norm from Output-Norm layers to explain why weight decay helps one case and hurts the other. Three lightweight changes to scale vectors, motivated by this analysis, each improve results and combine into a unified strategy that lowers final loss across model sizes from 0.12B to 2B parameters.

Core claim

In Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Distinguishing Input-Norm and Output-Norm layers shows that weight decay is beneficial for the former but harmful for the latter because of their distinct roles in optimization and expressivity. Three complementary modifications—branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization—each produce gains, and their combination yields lower terminal loss than tuned baselines while adding negligible overhead.

What carries the argument

The self-amplifying preconditioning effect that scale vectors apply to subsequent linear mappings in Pre-Norm architectures.

If this is right

Removing scale vectors substantially degrades LLM pre-training performance.
Weight decay benefits Input-Norm layers but harms Output-Norm layers.
Each of the three proposed changes to scale vectors improves training when applied separately.
The combined scale-vector strategy produces lower terminal loss and more favorable scaling across dense and MoE models from 0.12B to 2B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preconditioning view may explain performance differences observed when normalization is altered in non-transformer architectures.
The placement and reparameterization rules could be tested directly against alternative optimizers or learning-rate schedules not covered in the experiments.
Similar scale-vector adjustments might reduce the need for extensive hyperparameter search when scaling models beyond the 2B regime studied.
Measuring the amplification factor on linear mappings during training would provide an independent check on the optimization claim.

Load-bearing premise

The theoretical distinction between Input-Norm and Output-Norm layers and their opposing effects on weight decay generalizes beyond the analyzed settings to the full range of LLM pre-training configurations.

What would settle it

Train matched LLMs with and without scale vectors under identical optimizer and learning-rate conditions while directly measuring whether the predicted preconditioning amplification on linear-layer gradients appears or is absent.

read the original abstract

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scale vectors mainly aid optimization via preconditioning in Pre-Norm setups rather than expressivity, and the three lightweight changes deliver consistent pretraining gains across scales.

read the letter

The core point is that scale vectors in Pre-Norm LLMs do not boost expressivity but instead provide a self-amplifying preconditioning benefit on later linear layers, while Input-Norm and Output-Norm layers respond oppositely to weight decay. The paper backs this with removal experiments that hurt performance despite the tiny parameter count, plus theory that distinguishes the roles, and then tests three fixes: branch-specific heterogeneity, better placement around linears, and magnitude-direction reparameterization.

What stands out is the systematic angle—expressivity, optimization, and structure—plus the unified strategy that combines the three changes. Experiments run on dense and MoE models from 0.12B to 2B parameters, multiple optimizers, and industrial token counts, showing lower terminal loss and better scaling than tuned baselines with almost no added cost. The weight-decay distinction is a clean theoretical split that matches the empirical pattern.

The soft spots are modest. The theory is tied to the analyzed Pre-Norm settings and may need broader checks for other schedules or architectures. Gains are consistent but the paper does not detail statistical significance or variance across runs in the visible sections, so the effect size reliability is not fully clear yet. The literature overlap with prior normalization work is hard to judge without the full references.

This is useful for anyone tuning LLM pretraining or designing small architectural tweaks. It deserves a serious referee because the claims are directly testable, the experiments are broad enough to be informative, and the suggestions are low-overhead enough to try.

Referee Report

2 major / 2 minor

Summary. The paper claims that scale vectors in LLM normalization layers, though negligible in parameter count, significantly impact pre-training performance. In Pre-Norm architectures, they enhance optimization through a self-amplifying preconditioning effect on linear mappings without increasing expressivity. Weight decay benefits Input-Norm layers but harms Output-Norm layers. The authors propose three improvements—branch-specific heterogeneity, improved placement, and magnitude-direction reparameterization—and show that a unified strategy leads to better performance in pre-training experiments from 0.12B to 2B parameters across optimizers and schedules.

Significance. This work provides both theoretical insight into scale vectors' role in optimization and practical recommendations that add negligible overhead. The extensive validation through pre-training runs at multiple scales, with direct tests via removal experiments and reparameterizations, is a notable strength. If the theoretical claims are verified, it could influence how normalization is handled in future LLM architectures.

major comments (2)

Theoretical analysis: The self-amplifying preconditioning effect is central to the claim that scale vectors improve optimization rather than expressivity; however, the manuscript would benefit from explicit equations detailing the mechanism by which the scale vector preconditions subsequent linear mappings.
Experimental evaluation (0.12B–2B models): The pre-training experiments report lower terminal loss for the unified strategy but omit error bars, number of independent runs, or exclusion criteria, which is important for assessing the reliability of the 'consistent gains' claim across optimizers and learning rate schedules.

minor comments (2)

Some figures could include more detailed captions explaining the axes and what the different lines represent for clarity.
The abstract is quite long and dense; consider condensing the contributions for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the major comments point-by-point below.

read point-by-point responses

Referee: Theoretical analysis: The self-amplifying preconditioning effect is central to the claim that scale vectors improve optimization rather than expressivity; however, the manuscript would benefit from explicit equations detailing the mechanism by which the scale vector preconditions subsequent linear mappings.

Authors: We agree that adding explicit equations will clarify the central theoretical claim. The manuscript already derives that scale vectors do not increase expressivity in Pre-Norm settings but instead precondition linear layers; we will expand the theory section in revision with the precise update equations showing the self-amplifying effect on effective step sizes. revision: yes
Referee: Experimental evaluation (0.12B–2B models): The pre-training experiments report lower terminal loss for the unified strategy but omit error bars, number of independent runs, or exclusion criteria, which is important for assessing the reliability of the 'consistent gains' claim across optimizers and learning rate schedules.

Authors: We acknowledge the value of reporting run statistics. Given the industrial-scale compute required for 0.12B–2B pre-training, each configuration used a single run (standard for this regime). We will revise the experimental section to state this explicitly, note the lack of error bars due to cost, and highlight that improvements hold consistently across scales, optimizers, and schedules as supporting evidence of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical removal experiments showing performance degradation, followed by independent theoretical analysis of expressivity versus optimization effects (self-amplifying preconditioning in Pre-Norm) and opposing weight-decay roles for Input-Norm versus Output-Norm layers. These distinctions are then used to motivate reparameterizations that are directly tested in pre-training runs from 0.12B to 2B parameters. No load-bearing step reduces a prediction to a fitted quantity by construction, invokes self-citation for uniqueness, or renames an input as output; the theory and experiments remain externally falsifiable and self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; scale vectors are treated as existing learnable components.

pith-pipeline@v0.9.1-grok · 5843 in / 984 out tokens · 27231 ms · 2026-06-29T19:54:41.810048+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

Reference graph

Works this paper leans on

55 extracted references · 20 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

On the optimization of deep networks: Implicit acceleration by overparameterization

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. InInternational conference on machine learning, pages 244–253. PMLR, 2018

2018
[2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

2018
[4]

Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

Wenrui Cai, Defa Zhu, Qingjie Liu, and Qiyang Min. Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

work page arXiv 2025
[5]

Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

Chen Chen and Lai Wei. Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

work page arXiv 2026
[6]

Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

Alex Damian, Tengyu Ma, and Jason D Lee. Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

2021
[7]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480–7512. PMLR, 2023

2023
[8]

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

Yu Feng and Yuhai Tu. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

2021
[9]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

2018
[10]

Shape matters: Understanding the implicit bias of the noise covariance

Jeff Z HaoChen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. InConference on Learning Theory, pages 2315–2357. PMLR, 2021

2021
[11]

Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

Elad Hazan et al. Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

2016
[12]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 35:30016–30030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Andrej Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022

2022
[16]

Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

Jordan Keller et al. Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

2024
[17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Stochastic modified equations and adaptive stochastic gradient algorithms

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2017

2017
[19]

What happens after SGD reaches zero loss?–a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss?–a mathematical framework. International Conference on Learning Representations, 2022

2022
[20]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Noise and fluctuation of finite learning rate stochastic gradient descent

Kangqiao Liu, Liu Ziyin, and Masahito Ueda. Noise and fluctuation of finite learning rate stochastic gradient descent. InInternational Conference on Machine Learning, pages 7045–7056. PMLR, 2021

2021
[22]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017
[23]

Optimizing neural networks with kronecker-factored approximate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 07–09 Jul 2015

2015
[24]

Logarithmic landscape and power-law escape rate of SGD

Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Logarithmic landscape and power-law escape rate of SGD. arXiv preprint arXiv:2105.09557, pages 15959–15975, 2021

work page arXiv 2021
[25]

Power-law escape rate of sgd

Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. InInternational Conference on Machine Learning, pages 15959–15975. PMLR, 2022

2022
[26]

Transformers without tears: Improving the normalization of self-attention

Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th international conference on spoken language translation, 2019

2019
[27]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training

Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training. arXiv preprint arXiv:2601.22966, 2026

work page arXiv 2026
[29]

Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. InConference on Learning Theory, pages 1674–1703. PMLR, 2017

2017
[30]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

2014
[31]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[32]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Gemma 3 technical report, 2025

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, et al. Gemma 3 technical report, 2025

2025
[34]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

2012
[35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017
[37]

Soap: Improving and stabilizing shampoo using adam for language modeling

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam for language modeling. InInternational Conference on Learning Representations, volume 2025, pages 93423–93444, 2025

2025
[38]

Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

2024
[39]

The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, and Lei Wu. The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

2025
[40]

Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, and Lei Wu. Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

2026
[41]

InInter- national Conference on Learning Representations

Mingze Wang and Lei Wu. The noise geometry of stochastic gradient descent: A quantitative and analytical characterization. arXiv preprint arXiv:2310.00692, 2023. 15

work page arXiv 2023
[42]

Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

Mingze Wang, Haotian He, Jinbo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, and Lei Wu. Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

2024
[43]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011

2011
[44]

Stochastic gradient descent with noise of machine learning type

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. part II: Continuous time analysis. arXiv preprint arXiv:2106.02588, 2021

work page arXiv 2021
[45]

The alignment property of sgd noise and how it helps select flat minima: A stability analysis

Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis. InAdvancesin Neural Information Processing Systems, volume 35, pages 4680–4693, 2022

2022
[46]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

2020
[47]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

2022
[50]

Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

2019
[51]

Transformers without normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. In Proceedings of the computer vision and pattern recognition conference, pages 14901–14911, 2025

2025
[52]

Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

work page arXiv 2026
[53]

Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

work page arXiv 2025
[54]

Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024

Liu Ziyin, Mingze Wang, Hongchao Li, and Lei Wu. Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024. 16 Appendix A Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 B Experimental Details. . . . . . . . . . ....

2024
[55]

Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s

We have A0∥w∥2 2 =dq−2a ⊤(a−a ⋆)−2λ∥w∥ 2 2. Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s. Therefore, A0∥w∥2 2 =dq π-a.s. Using invariance again, 0 = Z A0∥w∥2 2dπ=dq, which contradictsq >0. Thus sup t≥0 E∥γt∥2 2 =∞. •With weight decay onγ(µ >0): By Itô’s formula, d dt E∥γt∥2 2 =−2µE∥γ t∥2 2 −2E (γt ⊙w t)⊤(at −a ⋆) +dq =−2µE∥γ t∥2 2 −2E a⊤ t (at −a ⋆) +dq.(30)...

[1] [1]

On the optimization of deep networks: Implicit acceleration by overparameterization

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. InInternational conference on machine learning, pages 244–253. PMLR, 2018

2018

[2] [2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

2018

[4] [4]

Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

Wenrui Cai, Defa Zhu, Qingjie Liu, and Qiyang Min. Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

work page arXiv 2025

[5] [5]

Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

Chen Chen and Lai Wei. Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

work page arXiv 2026

[6] [6]

Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

Alex Damian, Tengyu Ma, and Jason D Lee. Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

2021

[7] [7]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480–7512. PMLR, 2023

2023

[8] [8]

The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

Yu Feng and Yuhai Tu. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

2021

[9] [9]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

2018

[10] [10]

Shape matters: Understanding the implicit bias of the noise covariance

Jeff Z HaoChen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. InConference on Learning Theory, pages 2315–2357. PMLR, 2021

2021

[11] [11]

Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

Elad Hazan et al. Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

2016

[12] [12]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 35:30016–30030, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Andrej Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022

2022

[16] [16]

Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

Jordan Keller et al. Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

2024

[17] [17]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[18] [18]

Stochastic modified equations and adaptive stochastic gradient algorithms

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2017

2017

[19] [19]

What happens after SGD reaches zero loss?–a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss?–a mathematical framework. International Conference on Learning Representations, 2022

2022

[20] [20]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Noise and fluctuation of finite learning rate stochastic gradient descent

Kangqiao Liu, Liu Ziyin, and Masahito Ueda. Noise and fluctuation of finite learning rate stochastic gradient descent. InInternational Conference on Machine Learning, pages 7045–7056. PMLR, 2021

2021

[22] [22]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [23]

Optimizing neural networks with kronecker-factored approximate curvature

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 07–09 Jul 2015

2015

[24] [24]

Logarithmic landscape and power-law escape rate of SGD

Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Logarithmic landscape and power-law escape rate of SGD. arXiv preprint arXiv:2105.09557, pages 15959–15975, 2021

work page arXiv 2021

[25] [25]

Power-law escape rate of sgd

Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. InInternational Conference on Machine Learning, pages 15959–15975. PMLR, 2022

2022

[26] [26]

Transformers without tears: Improving the normalization of self-attention

Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th international conference on spoken language translation, 2019

2019

[27] [27]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training

Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training. arXiv preprint arXiv:2601.22966, 2026

work page arXiv 2026

[29] [29]

Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis

Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. InConference on Learning Theory, pages 1674–1703. PMLR, 2017

2017

[30] [30]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

2014

[31] [31]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[32] [32]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Gemma 3 technical report, 2025

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, et al. Gemma 3 technical report, 2025

2025

[34] [34]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

2012

[35] [35]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017

[37] [37]

Soap: Improving and stabilizing shampoo using adam for language modeling

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam for language modeling. InInternational Conference on Learning Representations, volume 2025, pages 93423–93444, 2025

2025

[38] [38]

Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

2024

[39] [39]

The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, and Lei Wu. The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

2025

[40] [40]

Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, and Lei Wu. Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

2026

[41] [41]

InInter- national Conference on Learning Representations

Mingze Wang and Lei Wu. The noise geometry of stochastic gradient descent: A quantitative and analytical characterization. arXiv preprint arXiv:2310.00692, 2023. 15

work page arXiv 2023

[42] [42]

Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

Mingze Wang, Haotian He, Jinbo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, and Lei Wu. Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

2024

[43] [43]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011

2011

[44] [44]

Stochastic gradient descent with noise of machine learning type

Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. part II: Continuous time analysis. arXiv preprint arXiv:2106.02588, 2021

work page arXiv 2021

[45] [45]

The alignment property of sgd noise and how it helps select flat minima: A stability analysis

Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis. InAdvancesin Neural Information Processing Systems, volume 35, pages 4680–4693, 2022

2022

[46] [46]

On layer normalization in the transformer architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

2020

[47] [47]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[49] [49]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

2022

[50] [50]

Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

2019

[51] [51]

Transformers without normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. In Proceedings of the computer vision and pattern recognition conference, pages 14901–14911, 2025

2025

[52] [52]

Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

work page arXiv 2026

[53] [53]

Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

work page arXiv 2025

[54] [54]

Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024

Liu Ziyin, Mingze Wang, Hongchao Li, and Lei Wu. Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024. 16 Appendix A Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 B Experimental Details. . . . . . . . . . ....

2024

[55] [55]

Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s

We have A0∥w∥2 2 =dq−2a ⊤(a−a ⋆)−2λ∥w∥ 2 2. Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s. Therefore, A0∥w∥2 2 =dq π-a.s. Using invariance again, 0 = Z A0∥w∥2 2dπ=dq, which contradictsq >0. Thus sup t≥0 E∥γt∥2 2 =∞. •With weight decay onγ(µ >0): By Itô’s formula, d dt E∥γt∥2 2 =−2µE∥γ t∥2 2 −2E (γt ⊙w t)⊤(at −a ⋆) +dq =−2µE∥γ t∥2 2 −2E a⊤ t (at −a ⋆) +dq.(30)...