arxiv: 2604.23434 · v1 · submitted 2026-04-25 · 💻 cs.LG · cs.CL

Recognition: unknown

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

Lucky Verma

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:25 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords dynamic tanhlayer norm removalimplicit regularizationactivation saturationtransformer trainingregime dependencevalidation loss

0 comments

The pith

Dynamic Tanh replaces LayerNorm with learned activation bounding that regularizes only in severely data-scarce regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether removing LayerNorm via Dynamic Tanh (DyT) is generally beneficial. DyT replaces normalization with a learned tanh(alpha x) that bounds activations, producing measurable saturation. This bounding improves validation loss when training data is extremely limited relative to model size, but increases loss once more tokens become available. Experiments on GPT-2 models from 64M to 3.78B parameters, trained on 1M to 118M tokens, show the switch in sign, with supporting checks on Llama and ViT architectures. Interventions such as HardTanh and alpha scaling isolate the saturation mechanism as the driver.

Core claim

Dynamic Tanh removes LayerNorm by bounding activations with a learned tanh(alpha x). This bounding functions as a regime-dependent implicit regularizer: it improves validation loss by 27.3% at 64M parameters on 1M tokens but worsens loss by 18.8% at 64M on 118M tokens. The benefit disappears with added capacity while the penalty grows to +27.9%. Saturation reaches 49% of activations at 1M tokens versus 23% at 118M tokens. A 500-step saturation heuristic predicts the sign of DyT's effect with 75% raw accuracy on the calibration set. HardTanh reproduces the regime pattern, raising alpha at high data reduces the penalty, and vanilla training plus dropout matches DyT's data-rich loss.

What carries the argument

Learned tanh(alpha x) bounding in Dynamic Tanh, which induces activation saturation that supplies implicit regularization whose net effect flips with the data-to-compute ratio.

If this is right

DyT is advantageous only when training remains compute-limited (T/P below roughly 1.84).
Raising the learned alpha at high token counts monotonically reduces DyT's performance penalty.
HardTanh clipping produces the same regime-dependent pattern as DyT.
Saturation rates above approximately 40% correlate with positive DyT effects on validation loss.
Llama-scale DyT collapse occurs specifically when SwiGLU gates saturate, separating convergence from failure across seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The saturation heuristic could be used at the start of training to decide whether to keep or remove LayerNorm without running full ablations.
Similar learned bounding might substitute for other normalization layers in low-data vision or language settings.
Scaling curves that ignore regime-dependent regularization may mis-predict optimal model size when data is the bottleneck.

Load-bearing premise

Performance differences arise from the saturation-based regularization rather than other uncontrolled differences in the training runs.

What would settle it

A controlled comparison in which dropout or weight decay is tuned to match DyT's effective regularization strength without any bounding, yet the low-data benefit and high-data penalty both disappear.

Figures

Figures reproduced from arXiv: 2604.23434 by Lucky Verma.

**Figure 1.** Figure 1: Phase diagrams reveal opposite regime preferences. ∆ validation loss vs. vanilla across the four primary GPT-2-family scales (64M–1.3B) and data regimes (3 seeds each); the Scale 5 stress test appears in view at source ↗

**Figure 2.** Figure 2: Activation saturation tracks DyT’s effect. Light red/blue regions show the threshold rule’s view at source ↗

**Figure 3.** Figure 3: The DyT regularization mechanism is regime-dependent. DyT is defined as DyT(x) = tanh(αx) · γ + β; the learnable α controls saturation depth. The same architecture and hyperparameters produce opposite outcomes depending on the token-to-parameter ratio r = T /P. Left: In the 64M/1M low-T /P cell (r = 0.016), 49% of DyT activations are saturated (|αx| > 2), creating a capacity bottleneck that prevents memo… view at source ↗

**Figure 4.** Figure 4: Learned α values per DyT layer (64M model). At 1M tokens (overparameterized), deeper layers learn larger α, reducing saturation to preserve some capacity. At 118M tokens, α values are lower throughout, reflecting reduced saturation pressure. Layer numbering: odd = pre-attention (ln_1), even = pre-FFN (ln_2), 25 = final (ln_f). D Convergence Analysis view at source ↗

**Figure 5.** Figure 5: Representative convergence curves at 118M tokens (64M params, seed 1337). view at source ↗

**Figure 6.** Figure 6: Train/val loss at 64M params, 1M tokens (seed=1337). Vanilla memorizes completely view at source ↗

**Figure 7.** Figure 7: Model-scale dependence (Scales 1–5, 3 seeds). Main panel: DyT’s 1M regularization view at source ↗

read the original abstract

Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning 64M to 3.78B parameters and 1M to 118M tokens, with Llama and ViT cross-checks, DyT improves validation loss by 27.3% at 64M/1M but worsens it by 18.8% at 64M/118M; the 1M benefit vanishes with capacity (+1.7% at 3.78B), while the 118M penalty reaches +27.9%. The mechanism is measurable: 49% of DyT activations saturate at 1M versus 23% at 118M, and a 500-step saturation heuristic classifies DyT's sign with 75% raw in-sample accuracy on the 12-cell GPT-2 calibration set (AUC 0.75; 64% when adding Scale 5 stress cells), correctly labels 3/3 Llama checks, but only reaches 50% raw leave-one-scale-out accuracy. Three interventions support the bounding explanation: HardTanh reproduces the regime pattern, increasing alpha at 118M monotonically reduces DyT's penalty, and vanilla+dropout(p=0.5) matches DyT's data-rich loss. We also localize Llama-DyT collapse to SwiGLU gating, where saturation separates collapse from convergence in a 3-seed component ablation (r=0.94). Scope: all experiments are compute-limited (T/P < 1.84), below Chinchilla-optimal training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DyT helps low-data training but hurts with more tokens, with saturation as a measurable but imperfect predictor.

read the letter

The key point is that swapping LayerNorm for DyT improves validation loss substantially in the smallest data regimes but degrades it once token counts rise, and the authors link this to activation saturation levels that drop as data increases. They back this with runs across GPT-2 sizes from 64M to 3.78B parameters and token counts from 1M to 118M, plus spot checks on Llama and ViT, all kept compute-limited below Chinchilla scale. The numbers are concrete: 27.3% better loss at 64M/1M versus 18.8% worse at 64M/118M, with saturation at 49% versus 23% respectively. A simple 500-step saturation check predicts the sign of the DyT effect at 75% in-sample accuracy on their 12-cell grid, though it falls to 50% in leave-one-scale-out tests. Three interventions—HardTanh, alpha tuning at high data, and dropout matching—line up with the bounding story, and they also trace Llama collapse to SwiGLU saturation with a strong correlation across seeds. This is new in its systematic scale sweep and the attempt to turn saturation into a quick diagnostic. The work is honest about its scope and reports the patterns directly without overclaiming universality. The soft spots sit in the causal isolation. Removing LayerNorm changes scaling, gradients, and optimization paths at once, so the performance gaps could stem from those side effects rather than bounding alone; HardTanh still removes normalization, and the alpha result at 118M sits a bit awkwardly with the lower saturation count. The heuristic's out-of-sample accuracy is weak enough that it functions more as a hint than a reliable classifier. Generalization beyond these small regimes remains open. This paper is for people who train transformers on limited data and want practical guidance on normalization choices. It deserves peer review because the empirical patterns are reproducible in principle and the question matters for architecture decisions, even if the mechanism section will need tighter controls and more out-of-distribution checks.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces Dynamic Tanh (DyT) as a LayerNorm replacement via learned tanh(αx) activation bounding. It claims this bounding functions as a regime-dependent implicit regularizer, improving validation loss by 27.3% at 64M/1M tokens but worsening it by 18.8% at 64M/118M (with the low-data benefit vanishing at 3.78B scale and the high-data penalty reaching 27.9%). The mechanism is supported by saturation rates (49% at 1M vs. 23% at 118M), a 500-step saturation heuristic (75% in-sample accuracy, AUC 0.75 on GPT-2 calibration; 50% leave-one-scale-out), and three interventions (HardTanh reproduction, α-increase at 118M, dropout(p=0.5) matching). Cross-checks on Llama/ViT and localization of Llama-DyT collapse to SwiGLU (r=0.94) are included. All experiments are compute-limited (T/P < 1.84, below Chinchilla optimality).

Significance. If the causal attribution to activation bounding holds, the work supplies a practical, measurable criterion for when LayerNorm removal is beneficial and a concrete proxy (saturation heuristic) for implicit regularization strength. The scale sweeps, cross-architecture checks, and explicit performance deltas constitute a useful empirical contribution in the low-compute regime, though the limited scope restricts immediate generalization.

major comments (3)

[§4.3] §4.3 (intervention experiments): HardTanh reproduces the regime-dependent pattern but still removes LayerNorm, so observed differences cannot be attributed specifically to tanh bounding rather than ancillary changes in per-layer scaling, gradient flow, or optimization trajectory. This leaves the central causal claim under-isolated.
[§5.1] §5.1 (saturation heuristic): The 500-step classifier reaches only 75% raw in-sample accuracy (AUC 0.75) on the 12-cell GPT-2 set and falls to 50% leave-one-scale-out; this modest performance weakens the assertion that saturation percentages (49% vs. 23%) measurably explain DyT sign and performance differences.
[§4.4] §4.4 (α-tuning results): Monotonic reduction of the 118M DyT penalty with increasing α is consistent with tunable bounding, yet appears inconsistent with the lower saturation rate (23%) being the source of the penalty, since higher α should increase saturation if bounding drives the effect.

minor comments (3)

[Methods] Methods section: key deltas (27.3%, 18.8%, etc.) are reported without error bars or seed statistics, complicating assessment of whether regime differences are statistically reliable.
[Abstract and §3] Abstract and §3: baseline details for vanilla comparisons (hyperparameter re-tuning after LayerNorm removal, optimizer settings) are insufficiently specified.
[Figures] Figure captions: activation-distribution plots would benefit from explicit saturation-threshold annotations and per-scale labeling.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We have carefully considered each major comment and made revisions to the paper to address the concerns about causal attribution, the strength of the saturation heuristic, and the consistency of the α-tuning results. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [§4.3] §4.3 (intervention experiments): HardTanh reproduces the regime-dependent pattern but still removes LayerNorm, so observed differences cannot be attributed specifically to tanh bounding rather than ancillary changes in per-layer scaling, gradient flow, or optimization trajectory. This leaves the central causal claim under-isolated.

Authors: We agree that using HardTanh as an intervention does not completely isolate the contribution of the tanh bounding function from other effects of removing LayerNorm, such as changes in scaling or optimization dynamics. The HardTanh experiment was intended to show that a fixed bounding activation can reproduce the regime-dependent behavior observed with DyT, supporting the role of activation bounding as the key mechanism. To better address this, we have revised the discussion in §4.3 to emphasize that the interventions collectively point to bounding rather than the specific functional form, and we acknowledge the need for more granular controls in future work. We have not added new experiments due to computational limitations. revision: partial
Referee: [§5.1] §5.1 (saturation heuristic): The 500-step classifier reaches only 75% raw in-sample accuracy (AUC 0.75) on the 12-cell GPT-2 set and falls to 50% leave-one-scale-out; this modest performance weakens the assertion that saturation percentages (49% vs. 23%) measurably explain DyT sign and performance differences.

Authors: The referee correctly notes the limitations in the predictive power of the saturation heuristic. With 75% in-sample accuracy and only 50% in leave-one-scale-out validation, the heuristic provides a suggestive but not definitive link between saturation rates and performance differences. We have updated §5.1 to describe the heuristic more cautiously as a preliminary indicator of implicit regularization strength rather than a reliable classifier, and we highlight the cross-validation results to temper the claims. This revision better reflects the empirical support while retaining the observation that saturation rates differ substantially across regimes. revision: yes
Referee: [§4.4] §4.4 (α-tuning results): Monotonic reduction of the 118M DyT penalty with increasing α is consistent with tunable bounding, yet appears inconsistent with the lower saturation rate (23%) being the source of the penalty, since higher α should increase saturation if bounding drives the effect.

Authors: This is a perceptive point regarding potential inconsistency in our mechanistic interpretation. If the high-data penalty were primarily due to insufficient bounding (low saturation), increasing α should exacerbate the penalty by increasing saturation, yet we observe the opposite. We interpret this as suggesting that the effect of bounding may depend on factors beyond the raw saturation percentage, such as the timing or distribution of saturated activations during training. We have added a paragraph in §4.4 discussing this observation and its implications for the regime-dependent regularizer hypothesis, noting it as an area for further investigation. No changes to the experimental results were needed. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical measurements and ablations stand independently

full rationale

The paper advances no derivation chain, first-principles equations, or predictions that reduce to fitted quantities defined within the work itself. All load-bearing claims rest on direct experimental outcomes: validation-loss deltas across scales, measured saturation fractions (49% vs 23%), an explicitly in-sample saturation heuristic whose accuracy is reported as such, and three ablation interventions (HardTanh, alpha scaling, dropout matching). No self-citation is invoked to justify uniqueness or to carry the central mechanism; the scope statement and cross-checks (Llama, ViT) are external to any internal fit. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is purely empirical; no free parameters, axioms, or invented entities are introduced beyond standard transformer training assumptions.

pith-pipeline@v0.9.0 · 5615 in / 1136 out tokens · 44724 ms · 2026-05-08T08:25:51.173144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 28 canonical work pages · 9 internal anchors

[1]

Does your optimizer care how you normalize? Normalization-Optimizer coupling in LLM training

Abdelrahman Abouzeid. Does your optimizer care how you normalize? Normalization-Optimizer coupling in LLM training. ArXiv preprint, abs/2604.01563, 2026. URL https://arxiv.org/abs/2604.01563

work page arXiv 2026
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. ArXiv preprint, abs/2305.13245, 2023. URL https://arxiv.org/abs/2305.13245

work page internal anchor Pith review arXiv 2023
[3]

Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

Sergey Alekseev. Subcritical signal propagation at initialization in normalization-free transformers. ArXiv preprint, abs/2604.11890, 2026. URL https://arxiv.org/abs/2604.11890

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Bounded hyperbolic tangent: A stable and efficient alternative to pre-layer normalization in large language models

Hoyoon Byun, Youngjun Choi, Taero Kim, Sungrae Park, and Kyungwoo Song. Bounded hyperbolic tangent: A stable and efficient alternative to pre-layer normalization in large language models. ArXiv preprint, abs/2601.09719, 2026. URL https://arxiv.org/abs/2601.09719

work page arXiv 2026
[5]

WeightWatcher : Diagnostics for deep neural networks

CalculatedContent . WeightWatcher : Diagnostics for deep neural networks. https://github.com/CalculatedContent/WeightWatcher, 2023

2023
[6]

DINT transformer

Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Erlu Zhao, and Li Shi. DINT transformer. ArXiv preprint, abs/2501.17486, 2025. URL https://arxiv.org/abs/2501.17486

work page arXiv 2025
[7]

Stronger normalization-free transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, and Zhuang Liu. Stronger normalization-free transformers. ArXiv preprint, abs/2512.10938, 2025. URL https://arxiv.org/abs/2512.10938

work page arXiv 2025
[8]

Why do we need weight decay in modern deep learning? ArXiv, abs/2310.04415, 2023

Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning? ArXiv preprint, abs/2310.04415, 2023. URL https://arxiv.org/abs/2310.04415

work page arXiv 2023
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv preprint, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929

work page internal anchor Pith review arXiv 2010
[10]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. ArXiv preprint, abs/2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review arXiv 2022
[11]

Gated removal of normalization in transformers enables stable training and efficient inference

Andrei Kanavalau, Carmen Amo Alonso, and Sanjay Lall. Gated removal of normalization in transformers enables stable training and efficient inference. ArXiv preprint, abs/2602.10408, 2026. URL https://arxiv.org/abs/2602.10408

work page arXiv 2026
[12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. ArXiv preprint, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review arXiv 2001
[13]

Karagodin, S

Nikita Karagodin, Shu Ge, Yury Polyanskiy, and Philippe Rigollet. Normalization in attention dynamics. ArXiv preprint, abs/2510.22026, 2025. URL https://arxiv.org/abs/2510.22026

work page arXiv 2025
[14]

nanogpt, 2023

Andrej Karpathy. nanogpt, 2023. URL https://github.com/karpathy/nanoGPT

2023
[15]

Peri- LN : Revisiting normalization layer in the transformer architecture

Jeonghoon Kim, Byeongchan Lee, Cheonbok Park, Yeontaek Oh, Beomjun Kim, Taehwan Yoo, Seongjin Shin, Dongyoon Han, Jinwoo Shin, and Kang Min Yoo. Peri- LN : Revisiting normalization layer in the transformer architecture. ArXiv preprint, abs/2502.02732, 2025. URL https://arxiv.org/abs/2502.02732

work page arXiv 2025
[16]

Grouped differential attention

Junghwan Lim, Sungmin Lee, Dongseok Kim, Wai Ting Cheung, Beomgyu Kim, Taehwan Kim, Haesol Lee, Junhyeok Lee, Dongpin Oh, and Eunhwan Park. Grouped differential attention. ArXiv preprint, abs/2510.06949, 2025. URL https://arxiv.org/abs/2510.06949

work page arXiv 2025
[17]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

2019
[18]

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, and Stanisław Jastrzęb- ski

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, and Boris Ginsburg. ngpt: Normalized transformer with representation learning on the hypersphere. ArXiv preprint, abs/2410.01131, 2024. URL https://arxiv.org/abs/2410.01131

work page arXiv 2024
[19]

Martin, Tian Peng, and Michael W

Charles H Martin, Tongsu Serena Peng, and Michael W Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data. Nature Communications, 12: 0 4122, 2021. doi:10.1038/s41467-021-24025-8

work page doi:10.1038/s41467-021-24025-8 2021
[20]

Do transformer modifications transfer across implementations and applications? arXiv preprint arXiv:2102.11972, 2021

Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, and Colin Raffel. Do transformer modifications transfer across implementations and applications? ArXiv preprint, abs/2102.11972, 2021. URL https://arxiv.o...

work page arXiv 2021
[21]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

work page doi:10.18653/v1/p16-1144 2016
[22]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. ArXiv preprint, abs/2505.06708, 2025. URL https://arxiv.org/abs/2505.06708

work page internal anchor Pith review arXiv 2025
[23]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer. ArXiv preprint, abs/2002.05202, 2020. URL https://arxiv.org/abs/2002.05202

work page internal anchor Pith review arXiv 2002
[24]

Impact of layer norm on memorization and generalization in transformers

Rishi Singhal and Jung-Eun Kim. Impact of layer norm on memorization and generalization in transformers. ArXiv preprint, abs/2511.10566, 2025. URL https://arxiv.org/abs/2511.10566

work page arXiv 2025
[25]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 19 0 (70): 0 1--57, 2018. URL https://jmlr.org/papers/v19/18-188.html

2018
[26]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. ArXiv preprint, abs/2104.09864, 2021. URL https://arxiv.org/abs/2104.09864

work page internal anchor Pith review arXiv 2021
[27]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA : Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023. URL https://arxiv.org/ab...

work page internal anchor Pith review arXiv 2023
[28]

Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BL i MP : The benchmark of linguistic minimal pairs for E nglish. Transactions of the Association for Computational Linguistics, 8: 0 377--392, 2020. doi:10.1162/tacl_a_00321. URL https://aclanthology.org/2020.tacl-1.25

work page doi:10.1162/tacl_a_00321 2020
[29]

Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W. Mahoney. PyHessian : Neural networks through the lens of the Hessian . ArXiv preprint, abs/1912.07145, 2019. URL https://arxiv.org/abs/1912.07145

work page arXiv 1912
[30]

Differential transformer, 2025

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer. ArXiv preprint, abs/2410.05258, 2024. URL https://arxiv.org/abs/2410.05258

work page arXiv 2024
[31]

Differential transformer v2

Tianzhu Ye, Li Dong, Yutao Sun, and Furu Wei. Differential transformer v2. Microsoft Research Hugging Face blog, 2026. URL https://huggingface.co/blog/microsoft/diff-attn-v2

2026
[32]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32, pages 12360--12371, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html

2019
[33]

Transformers without normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without normalization. ArXiv preprint, abs/2503.10622, 2025. URL https://arxiv.org/abs/2503.10622

work page arXiv 2025
[34]

Juliusz Ziomek, George Whittle, and Michael A. Osborne. Just one layer norm guarantees stable extrapolation. ArXiv preprint, abs/2505.14512, 2025. URL https://arxiv.org/abs/2505.14512

work page arXiv 2025