pith. machine review for the scientific record. sign in

arxiv: 2604.23434 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CL

Recognition: unknown

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords dynamic tanhlayer norm removalimplicit regularizationactivation saturationtransformer trainingregime dependencevalidation loss
0
0 comments X

The pith

Dynamic Tanh replaces LayerNorm with learned activation bounding that regularizes only in severely data-scarce regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether removing LayerNorm via Dynamic Tanh (DyT) is generally beneficial. DyT replaces normalization with a learned tanh(alpha x) that bounds activations, producing measurable saturation. This bounding improves validation loss when training data is extremely limited relative to model size, but increases loss once more tokens become available. Experiments on GPT-2 models from 64M to 3.78B parameters, trained on 1M to 118M tokens, show the switch in sign, with supporting checks on Llama and ViT architectures. Interventions such as HardTanh and alpha scaling isolate the saturation mechanism as the driver.

Core claim

Dynamic Tanh removes LayerNorm by bounding activations with a learned tanh(alpha x). This bounding functions as a regime-dependent implicit regularizer: it improves validation loss by 27.3% at 64M parameters on 1M tokens but worsens loss by 18.8% at 64M on 118M tokens. The benefit disappears with added capacity while the penalty grows to +27.9%. Saturation reaches 49% of activations at 1M tokens versus 23% at 118M tokens. A 500-step saturation heuristic predicts the sign of DyT's effect with 75% raw accuracy on the calibration set. HardTanh reproduces the regime pattern, raising alpha at high data reduces the penalty, and vanilla training plus dropout matches DyT's data-rich loss.

What carries the argument

Learned tanh(alpha x) bounding in Dynamic Tanh, which induces activation saturation that supplies implicit regularization whose net effect flips with the data-to-compute ratio.

If this is right

  • DyT is advantageous only when training remains compute-limited (T/P below roughly 1.84).
  • Raising the learned alpha at high token counts monotonically reduces DyT's performance penalty.
  • HardTanh clipping produces the same regime-dependent pattern as DyT.
  • Saturation rates above approximately 40% correlate with positive DyT effects on validation loss.
  • Llama-scale DyT collapse occurs specifically when SwiGLU gates saturate, separating convergence from failure across seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The saturation heuristic could be used at the start of training to decide whether to keep or remove LayerNorm without running full ablations.
  • Similar learned bounding might substitute for other normalization layers in low-data vision or language settings.
  • Scaling curves that ignore regime-dependent regularization may mis-predict optimal model size when data is the bottleneck.

Load-bearing premise

Performance differences arise from the saturation-based regularization rather than other uncontrolled differences in the training runs.

What would settle it

A controlled comparison in which dropout or weight decay is tuned to match DyT's effective regularization strength without any bounding, yet the low-data benefit and high-data penalty both disappear.

Figures

Figures reproduced from arXiv: 2604.23434 by Lucky Verma.

Figure 1
Figure 1. Figure 1: Phase diagrams reveal opposite regime preferences. ∆ validation loss vs. vanilla across the four primary GPT-2-family scales (64M–1.3B) and data regimes (3 seeds each); the Scale 5 stress test appears in view at source ↗
Figure 2
Figure 2. Figure 2: Activation saturation tracks DyT’s effect. Light red/blue regions show the threshold rule’s view at source ↗
Figure 3
Figure 3. Figure 3: The DyT regularization mechanism is regime-dependent. DyT is defined as DyT(x) = tanh(αx) · γ + β; the learnable α controls saturation depth. The same architecture and hyperpa￾rameters produce opposite outcomes depending on the token-to-parameter ratio r = T /P. Left: In the 64M/1M low-T /P cell (r = 0.016), 49% of DyT activations are saturated (|αx| > 2), creat￾ing a capacity bottleneck that prevents memo… view at source ↗
Figure 4
Figure 4. Figure 4: Learned α values per DyT layer (64M model). At 1M tokens (overparameterized), deeper layers learn larger α, reducing saturation to preserve some capacity. At 118M tokens, α values are lower throughout, reflecting reduced saturation pressure. Layer numbering: odd = pre-attention (ln_1), even = pre-FFN (ln_2), 25 = final (ln_f). D Convergence Analysis view at source ↗
Figure 5
Figure 5. Figure 5: Representative convergence curves at 118M tokens (64M params, seed 1337). view at source ↗
Figure 6
Figure 6. Figure 6: Train/val loss at 64M params, 1M tokens (seed=1337). Vanilla memorizes completely view at source ↗
Figure 7
Figure 7. Figure 7: Model-scale dependence (Scales 1–5, 3 seeds). Main panel: DyT’s 1M regularization view at source ↗
read the original abstract

Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT's sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P < 1.84), below Chinchilla-optimal training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Dynamic Tanh (DyT) as a LayerNorm replacement via learned tanh(αx) activation bounding. It claims this bounding functions as a regime-dependent implicit regularizer, improving validation loss by 27.3% at 64M/1M tokens but worsening it by 18.8% at 64M/118M (with the low-data benefit vanishing at 3.78B scale and the high-data penalty reaching 27.9%). The mechanism is supported by saturation rates (49% at 1M vs. 23% at 118M), a 500-step saturation heuristic (75% in-sample accuracy, AUC 0.75 on GPT-2 calibration; 50% leave-one-scale-out), and three interventions (HardTanh reproduction, α-increase at 118M, dropout(p=0.5) matching). Cross-checks on Llama/ViT and localization of Llama-DyT collapse to SwiGLU (r=0.94) are included. All experiments are compute-limited (T/P < 1.84, below Chinchilla optimality).

Significance. If the causal attribution to activation bounding holds, the work supplies a practical, measurable criterion for when LayerNorm removal is beneficial and a concrete proxy (saturation heuristic) for implicit regularization strength. The scale sweeps, cross-architecture checks, and explicit performance deltas constitute a useful empirical contribution in the low-compute regime, though the limited scope restricts immediate generalization.

major comments (3)
  1. [§4.3] §4.3 (intervention experiments): HardTanh reproduces the regime-dependent pattern but still removes LayerNorm, so observed differences cannot be attributed specifically to tanh bounding rather than ancillary changes in per-layer scaling, gradient flow, or optimization trajectory. This leaves the central causal claim under-isolated.
  2. [§5.1] §5.1 (saturation heuristic): The 500-step classifier reaches only 75% raw in-sample accuracy (AUC 0.75) on the 12-cell GPT-2 set and falls to 50% leave-one-scale-out; this modest performance weakens the assertion that saturation percentages (49% vs. 23%) measurably explain DyT sign and performance differences.
  3. [§4.4] §4.4 (α-tuning results): Monotonic reduction of the 118M DyT penalty with increasing α is consistent with tunable bounding, yet appears inconsistent with the lower saturation rate (23%) being the source of the penalty, since higher α should increase saturation if bounding drives the effect.
minor comments (3)
  1. [Methods] Methods section: key deltas (27.3%, 18.8%, etc.) are reported without error bars or seed statistics, complicating assessment of whether regime differences are statistically reliable.
  2. [Abstract and §3] Abstract and §3: baseline details for vanilla comparisons (hyperparameter re-tuning after LayerNorm removal, optimizer settings) are insufficiently specified.
  3. [Figures] Figure captions: activation-distribution plots would benefit from explicit saturation-threshold annotations and per-scale labeling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We have carefully considered each major comment and made revisions to the paper to address the concerns about causal attribution, the strength of the saturation heuristic, and the consistency of the α-tuning results. Our point-by-point responses are provided below.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (intervention experiments): HardTanh reproduces the regime-dependent pattern but still removes LayerNorm, so observed differences cannot be attributed specifically to tanh bounding rather than ancillary changes in per-layer scaling, gradient flow, or optimization trajectory. This leaves the central causal claim under-isolated.

    Authors: We agree that using HardTanh as an intervention does not completely isolate the contribution of the tanh bounding function from other effects of removing LayerNorm, such as changes in scaling or optimization dynamics. The HardTanh experiment was intended to show that a fixed bounding activation can reproduce the regime-dependent behavior observed with DyT, supporting the role of activation bounding as the key mechanism. To better address this, we have revised the discussion in §4.3 to emphasize that the interventions collectively point to bounding rather than the specific functional form, and we acknowledge the need for more granular controls in future work. We have not added new experiments due to computational limitations. revision: partial

  2. Referee: [§5.1] §5.1 (saturation heuristic): The 500-step classifier reaches only 75% raw in-sample accuracy (AUC 0.75) on the 12-cell GPT-2 set and falls to 50% leave-one-scale-out; this modest performance weakens the assertion that saturation percentages (49% vs. 23%) measurably explain DyT sign and performance differences.

    Authors: The referee correctly notes the limitations in the predictive power of the saturation heuristic. With 75% in-sample accuracy and only 50% in leave-one-scale-out validation, the heuristic provides a suggestive but not definitive link between saturation rates and performance differences. We have updated §5.1 to describe the heuristic more cautiously as a preliminary indicator of implicit regularization strength rather than a reliable classifier, and we highlight the cross-validation results to temper the claims. This revision better reflects the empirical support while retaining the observation that saturation rates differ substantially across regimes. revision: yes

  3. Referee: [§4.4] §4.4 (α-tuning results): Monotonic reduction of the 118M DyT penalty with increasing α is consistent with tunable bounding, yet appears inconsistent with the lower saturation rate (23%) being the source of the penalty, since higher α should increase saturation if bounding drives the effect.

    Authors: This is a perceptive point regarding potential inconsistency in our mechanistic interpretation. If the high-data penalty were primarily due to insufficient bounding (low saturation), increasing α should exacerbate the penalty by increasing saturation, yet we observe the opposite. We interpret this as suggesting that the effect of bounding may depend on factors beyond the raw saturation percentage, such as the timing or distribution of saturated activations during training. We have added a paragraph in §4.4 discussing this observation and its implications for the regime-dependent regularizer hypothesis, noting it as an area for further investigation. No changes to the experimental results were needed. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements and ablations stand independently

full rationale

The paper advances no derivation chain, first-principles equations, or predictions that reduce to fitted quantities defined within the work itself. All load-bearing claims rest on direct experimental outcomes: validation-loss deltas across scales, measured saturation fractions (49% vs 23%), an explicitly in-sample saturation heuristic whose accuracy is reported as such, and three ablation interventions (HardTanh, alpha scaling, dropout matching). No self-citation is invoked to justify uniqueness or to carry the central mechanism; the scope statement and cross-checks (Llama, ViT) are external to any internal fit. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is purely empirical; no free parameters, axioms, or invented entities are introduced beyond standard transformer training assumptions.

pith-pipeline@v0.9.0 · 5615 in / 1136 out tokens · 44724 ms · 2026-05-08T08:25:51.173144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 28 canonical work pages · 9 internal anchors

  1. [1]

    Does your optimizer care how you normalize? Normalization-Optimizer coupling in LLM training

    Abdelrahman Abouzeid. Does your optimizer care how you normalize? Normalization-Optimizer coupling in LLM training. ArXiv preprint, abs/2604.01563, 2026. URL https://arxiv.org/abs/2604.01563

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. ArXiv preprint, abs/2305.13245, 2023. URL https://arxiv.org/abs/2305.13245

  3. [3]

    Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

    Sergey Alekseev. Subcritical signal propagation at initialization in normalization-free transformers. ArXiv preprint, abs/2604.11890, 2026. URL https://arxiv.org/abs/2604.11890

  4. [4]

    Bounded hyperbolic tangent: A stable and efficient alternative to pre-layer normalization in large language models

    Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, and Kyungwoo Song. Bounded hyperbolic tangent: A stable and efficient alternative to pre-layer normalization in large language models. ArXiv preprint, abs/2601.09719, 2026. URL https://arxiv.org/abs/2601.09719

  5. [5]

    WeightWatcher : Diagnostics for deep neural networks

    CalculatedContent . WeightWatcher : Diagnostics for deep neural networks. https://github.com/CalculatedContent/WeightWatcher, 2023

  6. [6]

    DINT transformer

    Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Erlu Zhao, and Li Shi. DINT transformer. ArXiv preprint, abs/2501.17486, 2025. URL https://arxiv.org/abs/2501.17486

  7. [7]

    Stronger normalization-free transformers

    Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu. Stronger normalization-free transformers. ArXiv preprint, abs/2512.10938, 2025. URL https://arxiv.org/abs/2512.10938

  8. [8]

    Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415, 2023

    Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? ArXiv preprint, abs/2310.04415, 2023. URL https://arxiv.org/abs/2310.04415

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv preprint, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929

  10. [10]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ArXiv preprint, abs/2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

  11. [11]

    Gated removal of normalization in transformers enables stable training and efficient inference

    Andrei Kanavalau, Carmen Amo Alonso, and Sanjay Lall. Gated removal of normalization in transformers enables stable training and efficient inference. ArXiv preprint, abs/2602.10408, 2026. URL https://arxiv.org/abs/2602.10408

  12. [12]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv preprint, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

  13. [13]

    Karagodin, S

    Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet. Normalization in attention dynamics. ArXiv preprint, abs/2510.22026, 2025. URL https://arxiv.org/abs/2510.22026

  14. [14]

    nanogpt, 2023

    Andrej Karpathy. nanogpt, 2023. URL https://github.com/karpathy/nanoGPT

  15. [15]

    Peri- LN : Revisiting normalization layer in the transformer architecture

    Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, and Kang Min Yoo. Peri- LN : Revisiting normalization layer in the transformer architecture. ArXiv preprint, abs/2502.02732, 2025. URL https://arxiv.org/abs/2502.02732

  16. [16]

    Grouped differential attention

    Junghwan Lim, Sungmin Lee, Dongseok Kim, Wai Ting Cheung, Beomgyu Kim, Taehwan Kim, Haesol Lee, Junhyeok Lee, Dongpin Oh, and Eunhwan Park. Grouped differential attention. ArXiv preprint, abs/2510.06949, 2025. URL https://arxiv.org/abs/2510.06949

  17. [17]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  18. [18]

    Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

    Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere. ArXiv preprint, abs/2410.01131, 2024. URL https://arxiv.org/abs/2410.01131

  19. [19]

    Martin, Tian Peng, and Michael W

    Charles H Martin, Tongsu Serena Peng, and Michael W Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications, 12: 0 4122, 2021. doi:10.1038/s41467-021-24025-8

  20. [20]

    Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

    Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications? ArXiv preprint, abs/2102.11972, 2021. URL https://arxiv.o...

  21. [21]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

  22. [22]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. ArXiv preprint, abs/2505.06708, 2025. URL https://arxiv.org/abs/2505.06708

  23. [23]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202

  24. [24]

    Impact of layer norm on memorization and generalization in transformers

    Rishi Singhal and Jung-Eun Kim. Impact of layer norm on memorization and generalization in transformers. ArXiv preprint, abs/2511.10566, 2025. URL https://arxiv.org/abs/2511.10566

  25. [25]

    The implicit bias of gradient descent on separable data

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018. URL https://jmlr.org/papers/v19/18-188.html

  26. [26]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864

  27. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/ab...

  28. [28]

    Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. doi:10.1162/tacl_a_00321. URL https://aclanthology.org/2020.tacl-1.25

  29. [29]

    Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W. Mahoney. PyHessian : Neural networks through the lens of the Hessian . ArXiv preprint, abs/1912.07145, 2019. URL https://arxiv.org/abs/1912.07145

  30. [30]

    Differential transformer, 2025

    Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer. ArXiv preprint, abs/2410.05258, 2024. URL https://arxiv.org/abs/2410.05258

  31. [31]

    Differential transformer v2

    Tianzhu Ye, Li Dong, Yutao Sun, and Furu Wei. Differential transformer v2. Microsoft Research Hugging Face blog, 2026. URL https://huggingface.co/blog/microsoft/diff-attn-v2

  32. [32]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32, pages 12360--12371, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html

  33. [33]

    Transformers without normalization

    Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. ArXiv preprint, abs/2503.10622, 2025. URL https://arxiv.org/abs/2503.10622

  34. [34]

    Juliusz Ziomek, George Whittle, and Michael A. Osborne. Just one layer norm guarantees stable extrapolation. ArXiv preprint, abs/2505.14512, 2025. URL https://arxiv.org/abs/2505.14512