pith. sign in

arxiv: 2605.26895 · v1 · pith:NDSOJ2IGnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· stat.ML

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Pith reviewed 2026-06-29 19:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords scale vectorslayer normalizationLLM pre-trainingoptimizationweight decaypre-norm architectureexpressivitypreconditioning
0
0 comments X

The pith

Scale vectors in Pre-Norm LLMs improve optimization by preconditioning linear mappings without increasing expressivity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that scale vectors, despite making up a tiny fraction of parameters, are essential for successful LLM pre-training because removing them causes clear degradation. In Pre-Norm architectures they do not expand what the model can represent; instead they create a self-amplifying preconditioning effect that eases the optimization of the linear layers that follow. The work separates Input-Norm from Output-Norm layers to explain why weight decay helps one case and hurts the other. Three lightweight changes to scale vectors, motivated by this analysis, each improve results and combine into a unified strategy that lowers final loss across model sizes from 0.12B to 2B parameters.

Core claim

In Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Distinguishing Input-Norm and Output-Norm layers shows that weight decay is beneficial for the former but harmful for the latter because of their distinct roles in optimization and expressivity. Three complementary modifications—branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization—each produce gains, and their combination yields lower terminal loss than tuned baselines while adding negligible overhead.

What carries the argument

The self-amplifying preconditioning effect that scale vectors apply to subsequent linear mappings in Pre-Norm architectures.

If this is right

  • Removing scale vectors substantially degrades LLM pre-training performance.
  • Weight decay benefits Input-Norm layers but harms Output-Norm layers.
  • Each of the three proposed changes to scale vectors improves training when applied separately.
  • The combined scale-vector strategy produces lower terminal loss and more favorable scaling across dense and MoE models from 0.12B to 2B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The preconditioning view may explain performance differences observed when normalization is altered in non-transformer architectures.
  • The placement and reparameterization rules could be tested directly against alternative optimizers or learning-rate schedules not covered in the experiments.
  • Similar scale-vector adjustments might reduce the need for extensive hyperparameter search when scaling models beyond the 2B regime studied.
  • Measuring the amplification factor on linear mappings during training would provide an independent check on the optimization claim.

Load-bearing premise

The theoretical distinction between Input-Norm and Output-Norm layers and their opposing effects on weight decay generalizes beyond the analyzed settings to the full range of LLM pre-training configurations.

What would settle it

Train matched LLMs with and without scale vectors under identical optimizer and learning-rate conditions while directly measuring whether the predicted preconditioning amplification on linear-layer gradients appears or is absent.

read the original abstract

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that scale vectors in LLM normalization layers, though negligible in parameter count, significantly impact pre-training performance. In Pre-Norm architectures, they enhance optimization through a self-amplifying preconditioning effect on linear mappings without increasing expressivity. Weight decay benefits Input-Norm layers but harms Output-Norm layers. The authors propose three improvements—branch-specific heterogeneity, improved placement, and magnitude-direction reparameterization—and show that a unified strategy leads to better performance in pre-training experiments from 0.12B to 2B parameters across optimizers and schedules.

Significance. This work provides both theoretical insight into scale vectors' role in optimization and practical recommendations that add negligible overhead. The extensive validation through pre-training runs at multiple scales, with direct tests via removal experiments and reparameterizations, is a notable strength. If the theoretical claims are verified, it could influence how normalization is handled in future LLM architectures.

major comments (2)
  1. Theoretical analysis: The self-amplifying preconditioning effect is central to the claim that scale vectors improve optimization rather than expressivity; however, the manuscript would benefit from explicit equations detailing the mechanism by which the scale vector preconditions subsequent linear mappings.
  2. Experimental evaluation (0.12B–2B models): The pre-training experiments report lower terminal loss for the unified strategy but omit error bars, number of independent runs, or exclusion criteria, which is important for assessing the reliability of the 'consistent gains' claim across optimizers and learning rate schedules.
minor comments (2)
  1. Some figures could include more detailed captions explaining the axes and what the different lines represent for clarity.
  2. The abstract is quite long and dense; consider condensing the contributions for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the major comments point-by-point below.

read point-by-point responses
  1. Referee: Theoretical analysis: The self-amplifying preconditioning effect is central to the claim that scale vectors improve optimization rather than expressivity; however, the manuscript would benefit from explicit equations detailing the mechanism by which the scale vector preconditions subsequent linear mappings.

    Authors: We agree that adding explicit equations will clarify the central theoretical claim. The manuscript already derives that scale vectors do not increase expressivity in Pre-Norm settings but instead precondition linear layers; we will expand the theory section in revision with the precise update equations showing the self-amplifying effect on effective step sizes. revision: yes

  2. Referee: Experimental evaluation (0.12B–2B models): The pre-training experiments report lower terminal loss for the unified strategy but omit error bars, number of independent runs, or exclusion criteria, which is important for assessing the reliability of the 'consistent gains' claim across optimizers and learning rate schedules.

    Authors: We acknowledge the value of reporting run statistics. Given the industrial-scale compute required for 0.12B–2B pre-training, each configuration used a single run (standard for this regime). We will revise the experimental section to state this explicitly, note the lack of error bars due to cost, and highlight that improvements hold consistently across scales, optimizers, and schedules as supporting evidence of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical removal experiments showing performance degradation, followed by independent theoretical analysis of expressivity versus optimization effects (self-amplifying preconditioning in Pre-Norm) and opposing weight-decay roles for Input-Norm versus Output-Norm layers. These distinctions are then used to motivate reparameterizations that are directly tested in pre-training runs from 0.12B to 2B parameters. No load-bearing step reduces a prediction to a fitted quantity by construction, invokes self-citation for uniqueness, or renames an input as output; the theory and experiments remain externally falsifiable and self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; scale vectors are treated as existing learnable components.

pith-pipeline@v0.9.1-grok · 5843 in / 984 out tokens · 27231 ms · 2026-06-29T19:54:41.810048+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors

    cs.LG 2026-06 unverdicted novelty 6.0

    MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.

Reference graph

Works this paper leans on

55 extracted references · 20 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    On the optimization of deep networks: Implicit acceleration by overparameterization

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. InInternational conference on machine learning, pages 244–253. PMLR, 2018

  2. [2]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  3. [3]

    Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.SIAM review, 60(2):223–311, 2018

  4. [4]

    Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

    Wenrui Cai, Defa Zhu, Qingjie Liu, and Qiyang Min. Seednorm: Self-rescaled dynamic normalization.arXiv preprint arXiv:2510.22777, 2025

  5. [5]

    Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

    Chen Chen and Lai Wei. Post-layernorm is back: Stable, expressive, and deep.arXiv preprint arXiv:2601.19895, 2026

  6. [6]

    Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

    Alex Damian, Tengyu Ma, and Jason D Lee. Label noise SGD provably prefers flat global minimizers.Advances in Neural Information Processing Systems, 34:27449–27461, 2021

  7. [7]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480–7512. PMLR, 2023

  8. [8]

    The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

    Yu Feng and Yuhai Tu. The inverse variance–flatness relation in stochastic gradient descent is critical for finding flat minima.Proceedings of the National Academy of Sciences, 118(9):e2015617118, 2021

  9. [9]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 10–15 Jul 2018

  10. [10]

    Shape matters: Understanding the implicit bias of the noise covariance

    Jeff Z HaoChen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. InConference on Learning Theory, pages 2315–2357. PMLR, 2021

  11. [11]

    Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

    Elad Hazan et al. Introduction to online convex optimization.Foundationsand Trends®in Optimization, 2(3-4): 157–325, 2016

  12. [12]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 35:30016–30030, 2022

  13. [13]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  14. [14]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015

  15. [15]

    Andrej Karpathy. NanoGPT. https://github.com/karpathy/nanoGPT, 2022

  16. [16]

    Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

    Jordan Keller et al. Muon optimizer.URL https://github.com/KellerJordan/Muon?tab=readme-ov-file, 2024

  17. [17]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  18. [18]

    Stochastic modified equations and adaptive stochastic gradient algorithms

    Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pages 2101–2110. PMLR, 2017

  19. [19]

    What happens after SGD reaches zero loss?–a mathematical framework

    Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss?–a mathematical framework. International Conference on Learning Representations, 2022

  20. [20]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  21. [21]

    Noise and fluctuation of finite learning rate stochastic gradient descent

    Kangqiao Liu, Liu Ziyin, and Masahito Ueda. Noise and fluctuation of finite learning rate stochastic gradient descent. InInternational Conference on Machine Learning, pages 7045–7056. PMLR, 2021

  22. [22]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 14

  23. [23]

    Optimizing neural networks with kronecker-factored approximate curvature

    James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 07–09 Jul 2015

  24. [24]

    Logarithmic landscape and power-law escape rate of SGD

    Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Logarithmic landscape and power-law escape rate of SGD. arXiv preprint arXiv:2105.09557, pages 15959–15975, 2021

  25. [25]

    Power-law escape rate of sgd

    Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. InInternational Conference on Machine Learning, pages 15959–15975. PMLR, 2022

  26. [26]

    Transformers without tears: Improving the normalization of self-attention

    Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th international conference on spoken language translation, 2019

  27. [27]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656, 2024

  28. [28]

    A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training

    Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training. arXiv preprint arXiv:2601.22966, 2026

  29. [29]

    Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis

    Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. InConference on Learning Theory, pages 1674–1703. PMLR, 2017

  30. [30]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.International Conference on Learning Representations, 2014

  31. [31]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  32. [32]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  33. [33]

    Gemma 3 technical report, 2025

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, et al. Gemma 3 technical report, 2025

  34. [34]

    Tieleman and G

    T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012

  35. [35]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  36. [36]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  37. [37]

    Soap: Improving and stabilizing shampoo using adam for language modeling

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. Soap: Improving and stabilizing shampoo using adam for language modeling. InInternational Conference on Learning Representations, volume 2025, pages 93423–93444, 2025

  38. [38]

    Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers.IEEE Transactionson Pattern Analysis and Machine Intelligence, 46(10):6761–6774, 2024

  39. [39]

    The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

    Jinbo Wang, Mingze Wang, Zhanpeng Zhou, Junchi Yan, Weinan E, and Lei Wu. The sharpness disparity principle in transformers for accelerating language model pre-training.International Conference on Machine Learning, pages 64859–64879, 2025

  40. [40]

    Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

    Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, and Lei Wu. Gradpower: Powering gradients for faster language model pre-training.International Conference on Machine Learning, 2026

  41. [41]

    InInter- national Conference on Learning Representations

    Mingze Wang and Lei Wu. The noise geometry of stochastic gradient descent: A quantitative and analytical characterization. arXiv preprint arXiv:2310.00692, 2023. 15

  42. [42]

    Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

    Mingze Wang, Haotian He, Jinbo Wang, Zilin Wang, Guanhua Huang, Feiyu Xiong, Zhiyu Li, Weinan E, and Lei Wu. Improving generalization and convergence by enhancing implicit regularization.Advances in Neural Information Processing Systems, 2024

  43. [43]

    Bayesian learning via stochastic gradient langevin dynamics

    Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. InProceedings of the 28th international conference on machine learning (ICML-11), pages 681–688. Citeseer, 2011

  44. [44]

    Stochastic gradient descent with noise of machine learning type

    Stephan Wojtowytsch. Stochastic gradient descent with noise of machine learning type. part II: Continuous time analysis. arXiv preprint arXiv:2106.02588, 2021

  45. [45]

    The alignment property of sgd noise and how it helps select flat minima: A stability analysis

    Lei Wu, Mingze Wang, and Weijie Su. The alignment property of sgd noise and how it helps select flat minima: A stability analysis. InAdvancesin Neural Information Processing Systems, volume 35, pages 4680–4693, 2022

  46. [46]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. InInternational Conference on Machine Learning, pages 10524–10533. PMLR, 2020

  47. [47]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  48. [48]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  49. [49]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022

  50. [50]

    Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advancesin Neural Information Processing Systems, 32, 2019

  51. [51]

    Transformers without normalization

    Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. In Proceedings of the computer vision and pattern recognition conference, pages 14901–14911, 2025

  52. [52]

    Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

    Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

  53. [53]

    Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

    Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, and Jinwen Ma. Hybridnorm: Towards stable and efficient transformer training via hybrid normalization.arXiv preprint arXiv:2503.04598, 2025

  54. [54]

    Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024

    Liu Ziyin, Mingze Wang, Hongchao Li, and Lei Wu. Parameter symmetry and noise equilibrium of stochastic gradient descent.Advancesin Neural Information Processing Systems, 37:93874–93906, 2024. 16 Appendix A Related Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 B Experimental Details. . . . . . . . . . ....

  55. [55]

    Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s

    We have A0∥w∥2 2 =dq−2a ⊤(a−a ⋆)−2λ∥w∥ 2 2. Sincew=0π-a.s., we also have a=γ⊙w=0π-a.s. Therefore, A0∥w∥2 2 =dq π-a.s. Using invariance again, 0 = Z A0∥w∥2 2dπ=dq, which contradictsq >0. Thus sup t≥0 E∥γt∥2 2 =∞. •With weight decay onγ(µ >0): By Itô’s formula, d dt E∥γt∥2 2 =−2µE∥γ t∥2 2 −2E (γt ⊙w t)⊤(at −a ⋆) +dq =−2µE∥γ t∥2 2 −2E a⊤ t (at −a ⋆) +dq.(30)...