Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

Andrei Kanavalau; Carmen Amo Alonso; Sanjay Lall

arxiv: 2602.10408 · v2 · pith:ALKCTMJHnew · submitted 2026-02-11 · 💻 cs.LG · cs.CL

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

Andrei Kanavalau , Carmen Amo Alonso , Sanjay Lall This is my paper

Pith reviewed 2026-05-21 14:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords normalization removalpre-norm transformersTaperNormscale anchoringgated mechanismsinference efficiencytransformer optimizationnorm-free models

0 comments

The pith

Pre-norm transformers can have internal normalization layers tapered to fixed linear maps with only small validation-loss increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops TaperNorm, a gated approach that starts with standard RMSNorm or LayerNorm and gradually replaces them with learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed and the maps fold into adjacent projections. Experiments in pre-training and fine-tuning show this tapering works with small validation-loss increases. The work reveals that the final normalization anchors the scale of the pre-logit representation so that radial changes in the last hidden state do not directly reduce cross-entropy loss. A fixed-target scale loss serves as an explicit alternative anchor that enables fully norm-free models, and the approach improves throughput up to 1.18 times in KV-cached autoregressive decoding after folding.

Core claim

Normalization layers are not required throughout pre-norm transformers; a gated tapering process can transition them to fixed sample-independent maps during training, producing only small validation-loss increases in the tested regimes while allowing the resulting maps to be folded into linear layers for inference. The final normalization layer has a distinct anchoring role that prevents loss reduction through simple scaling of logits, and an explicit scale-target loss can substitute for it to support norm-free ablations.

What carries the argument

TaperNorm, a gated normalization-removal mechanism that begins with RMSNorm or LayerNorm and gradually tapers to learned sample-independent linear or affine maps until the gate reaches zero.

If this is right

Once the gate reaches zero the learned maps can be folded into adjacent linear projections, removing per-token normalization computation at inference time.
A fixed-target scale loss provides an explicit alternative to the final normalization anchor and supports fully norm-free transformer variants.
Tapering internal norms yields up to 1.14 times higher throughput with explicit scaling and up to 1.18 times after folding in KV-cached autoregressive decoding.
With the final normalization anchor present, changes to the magnitude of the last hidden state do not directly improve cross-entropy loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scale-anchoring observation could guide the design of new output heads or loss terms in other sequence models that currently rely on final normalization.
Folding the tapered maps suggests a general route to simplify deployed transformer graphs by absorbing normalization constants into weights.
The gradual-taper training schedule might be adapted to other removable components such as residual connections or attention biases.

Load-bearing premise

The gradual tapering schedule and the specific pre-training and fine-tuning regimes tested will generalize when the gate reaches zero in larger production-scale models and tasks.

What would settle it

Train a large model to full gate-zero removal and measure whether validation loss stays within a small margin of the baseline across multiple tasks and scales.

Figures

Figures reproduced from arXiv: 2602.10408 by Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall.

**Figure 2.** Figure 2: Training loss vs. step for the Baseline and Internal-Taper (+aux). training details between same-size models are identical [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Average gradient norms across all Transformer blocks by weight type. Without explicit scale anchoring, gradients cluster primarily by presence vs. absence of the final normalization. With the fixed-target scale loss enabled, the gradient-magnitude gap between Internal-Taper and All-Taper largely disappears [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical gated tapering trick to drop internal norms after training and fold them into weights for modest inference gains, while framing the final norm as a scale anchor.

read the letter

Hi, the core thing to know is that the authors taper RMSNorm or LayerNorm layers down to fixed sample-independent maps using a gated schedule they call TaperNorm. Once the gate hits zero the per-token stats disappear, the maps fold into adjacent linear layers, and they report only small validation-loss increases in the pre-training and fine-tuning runs they tried. They also note that the final norm keeps the pre-logit scale from drifting, which explains why removing it lets the model reduce loss by simply growing logit magnitudes instead of learning better features. A fixed-target scale loss then lets them run fully norm-free ablations in those same regimes. In a KV-cached autoregressive decode test they see up to 1.18x throughput after folding. That combination of incremental removal technique and the scale-anchor observation is what feels new here. The empirical direction is useful because it directly targets inference latency without requiring a full redesign of the architecture. The throughput numbers are concrete enough to be worth checking. The soft spots sit mostly in the experimental reporting. The abstract gives high-level outcomes but no error bars, no dataset sizes, no hyperparameter schedules, and no ablation on how sensitive the tapering rate is. That leaves the generalization question open, exactly as the stress-test note flags: nothing in the current runs bounds how the loss penalty behaves at larger model scales or on different tasks once the gate is fully zero. The math itself is straightforward and the citation pattern looks standard for this corner of the literature. This is the sort of paper that would interest engineers who ship transformer inference at scale and want to squeeze out a few percent of latency by removing redundant normalizations. A reader already working on norm variants or efficient decoding would pick up the folding step and the scale-loss trick quickly. I would send it to peer review. The idea is testable, the practical payoff is clear, and referees can ask for the missing controls and larger-scale checks without the paper being fundamentally broken.

Referee Report

2 major / 2 minor

Summary. The paper proposes TaperNorm, a gated normalization-removal method for pre-norm transformers. Starting from standard RMSNorm or LayerNorm, it applies a gradual tapering schedule to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are eliminated and the maps can be folded into adjacent linear projections. The work reports that this yields small validation-loss increases in the tested pre-training and fine-tuning regimes, identifies a distinct scale-anchoring role for the final normalization layer, introduces a fixed-target scale loss that enables fully norm-free ablations, and measures up to 1.18× throughput gains in a KV-cached autoregressive decoding benchmark.

Significance. If the empirical outcomes hold under more rigorous controls, the contribution would be significant for simplifying transformer inference by removing internal normalization overhead and for clarifying the functional role of the final normalization as a scale anchor rather than a per-sample statistic. The explicit scale-loss alternative and the folding-based speedups could inform practical architecture choices in production models.

major comments (2)

The central empirical claim—that internal normalization can be tapered to zero with only small validation-loss increases—rests on results whose robustness cannot be assessed. The abstract and results description provide no error bars, exact dataset sizes, hyperparameter schedules, or ablation controls that would allow evaluation of whether the observed loss deltas are statistically meaningful or sensitive to the particular pre-training and fine-tuning regimes tested.
The generalization argument for production-scale models is load-bearing for the practical impact claim but is not supported by the current experiments. Nothing in the construction bounds the loss penalty or gradient dynamics once the gate reaches zero at larger model scales or different tasks; the tested regimes may not be representative, as noted in the stress-test concern.

minor comments (2)

The description of the tapering schedule and gate parameterization would benefit from an explicit equation or pseudocode block early in the methods section to make the transition from standard normalization to the sample-independent map fully reproducible.
Figure captions and table headers should explicitly state whether the reported throughput numbers include or exclude the cost of the explicit scaling operations versus the folded version.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment below, clarifying our experimental details and acknowledging the limits of our current scale of evaluation. Revisions have been made to improve transparency and to explicitly discuss generalization constraints.

read point-by-point responses

Referee: The central empirical claim—that internal normalization can be tapered to zero with only small validation-loss increases—rests on results whose robustness cannot be assessed. The abstract and results description provide no error bars, exact dataset sizes, hyperparameter schedules, or ablation controls that would allow evaluation of whether the observed loss deltas are statistically meaningful or sensitive to the particular pre-training and fine-tuning regimes tested.

Authors: We agree that the original presentation lacked sufficient detail for assessing statistical robustness. In the revised manuscript we now report error bars over three independent random seeds for all main pre-training and fine-tuning curves, state the precise dataset sizes and token counts used (C4 for pre-training, GLUE/SuperGLUE subsets for fine-tuning), and move the full hyperparameter schedules and tapering schedules into a new appendix section. Additional ablation tables controlling for gate initialization variance and schedule aggressiveness have also been added to demonstrate that the reported loss deltas remain small and consistent under these variations. revision: yes
Referee: The generalization argument for production-scale models is load-bearing for the practical impact claim but is not supported by the current experiments. Nothing in the construction bounds the loss penalty or gradient dynamics once the gate reaches zero at larger model scales or different tasks; the tested regimes may not be representative, as noted in the stress-test concern.

Authors: We acknowledge that our empirical results are confined to the model sizes and tasks described in the paper and that we provide neither theoretical bounds nor empirical data at substantially larger scales. The TaperNorm construction itself is scale-agnostic—the per-layer gates are learned independently of hidden dimension—but this does not guarantee identical loss behavior at production scales. In the revision we have expanded the Limitations and Future Work sections to state these constraints explicitly and to recommend stress-testing on larger models as necessary follow-up work. revision: partial

standing simulated objections not resolved

We cannot supply empirical results or strict theoretical bounds on loss penalty or gradient dynamics for model scales or task distributions beyond those evaluated in the current experiments.

Circularity Check

0 steps flagged

No circularity: empirical method validated by direct experiments

full rationale

The paper introduces TaperNorm as a practical gated tapering schedule for removing internal RMSNorm/LayerNorm layers in pre-norm transformers, then reports measured validation-loss deltas from pre-training and fine-tuning runs. No derivation chain exists that reduces a claimed prediction or first-principles result back to a fitted parameter, self-citation, or ansatz by construction; the central observations are straightforward empirical outcomes of the implemented schedule. The text contains no load-bearing self-citations, uniqueness theorems, or renamings of known results. The work is therefore self-contained as an experimental demonstration whose validity rests on the reported benchmarks rather than any internal logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations in unspecified pre-training and fine-tuning regimes; no explicit free parameters, axioms, or invented entities are introduced beyond the TaperNorm gating mechanism itself.

pith-pipeline@v0.9.0 · 5753 in / 1170 out tokens · 38261 ms · 2026-05-21T14:14:54.079881+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 4.1 (Final normalization removes radial gradient). Let the final map Norm_final before the output projection be 0-homogeneous... ⟨∇_h ℓ(z,y), h⟩ = 0
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TaperNorm(h;g) = g * (h/r(h)) Dγ + (1-g) c h Dγ̃ ... when g=0 the layer becomes linear/affine

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer
cs.LG 2026-04 unverdicted novelty 5.0

DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

org/abs/2003.04887

URL https://arxiv. org/abs/2003.04887. Baroni, L., Khara, G., Schaeffer, J., Subkhankulov, M., and Heimersheim, S. Transformers don’t need layernorm at inference time: Scaling layernorm removal to gpt-2 xl and the implications for mechanistic interpretability,

work page arXiv 2003
[2]

Brock, A., De, S., Smith, S

URLhttps://arxiv.org/abs/2507.02559. Brock, A., De, S., Smith, S. L., and Simonyan, K. High- performance large-scale image recognition without nor- malization,

work page arXiv
[3]

Smith, and Karen Simonyan

URL https://arxiv.org/abs/ 2102.06171. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., 8 Gated Removal of Normalization Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page arXiv
[4]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

doi: 10.48550/arXiv. 2101.00027. Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Open- webtext corpus. http://Skylion007.github. io/OpenWebTextCorpus,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[5]

Deep Residual Learning for Image Recognition

URL https:// arxiv.org/abs/1512.03385. Ioffe, S. and Szegedy, C. Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

URL https://arxiv.org/abs/ 1502.03167. Martens, J., Ballard, A., Desjardins, G., Swirszcz, G., Dal- ibard, V ., Sohl-Dickstein, J., and Schoenholz, S. S. Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H

URLhttps://arxiv.org/abs/2110.01765. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H. Going deeper with image transformers,

work page arXiv
[8]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

URLhttps://arxiv.org/abs/2103.17239. Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. Deepnet: Scaling transformers to 1,000 layers,

work page arXiv
[9]

Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y

URLhttps://arxiv.org/abs/2203.00555. Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y . On layer normalization in the transformer architecture,

work page arXiv
[10]

arXiv:2002.04745 [cs, stat] , author =

URL https://arxiv.org/abs/2002.04745. Zhang, B. and Sennrich, R. Root mean square layer nor- malization,

work page arXiv 2002
[11]

Root Mean Square Layer Normalization

URL https://arxiv.org/abs/ 1910.07467. Zhang, H., Dauphin, Y . N., and Ma, T. Fixup initialization: Residual learning without normalization,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[12]

Fixup Initialization: Residual Learning Without Normalization

URL https://arxiv.org/abs/1901.09321. Zhu, J., Chen, X., He, K., LeCun, Y ., and Liu, Z. Trans- formers without normalization,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[13]

Transformers without Normalization

URL https: //arxiv.org/abs/2503.10622. 9 Gated Removal of Normalization A. Proofs for Section 4 Preliminaries and assumptions We recall r(h) := p ∥h∥2 2/d+ε and Dγ := diag(γ), D˜γ:= diag(˜γ). Expectations E[·] are over mini-batch elements and sequence positions unless stated. A.1. Proof of Proposition 4.1 Proof. A map Norm is 0-homogeneous if Norm(αh) = N...

work page arXiv
[14]

TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup

Warmup uses 5% of total steps for every run. TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup. At the warmup boundary, each tapered layer computes its alignment scalar c using γ-weighted EMA estimates of the quantities in Section 3.4, copies γ→˜γ , and then freezes c. After warmup, we...

work page 2025

[1] [1]

org/abs/2003.04887

URL https://arxiv. org/abs/2003.04887. Baroni, L., Khara, G., Schaeffer, J., Subkhankulov, M., and Heimersheim, S. Transformers don’t need layernorm at inference time: Scaling layernorm removal to gpt-2 xl and the implications for mechanistic interpretability,

work page arXiv 2003

[2] [2]

Brock, A., De, S., Smith, S

URLhttps://arxiv.org/abs/2507.02559. Brock, A., De, S., Smith, S. L., and Simonyan, K. High- performance large-scale image recognition without nor- malization,

work page arXiv

[3] [3]

Smith, and Karen Simonyan

URL https://arxiv.org/abs/ 2102.06171. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., 8 Gated Removal of Normalization Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page arXiv

[4] [4]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

doi: 10.48550/arXiv. 2101.00027. Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Open- webtext corpus. http://Skylion007.github. io/OpenWebTextCorpus,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv

[5] [5]

Deep Residual Learning for Image Recognition

URL https:// arxiv.org/abs/1512.03385. Ioffe, S. and Szegedy, C. Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

URL https://arxiv.org/abs/ 1502.03167. Martens, J., Ballard, A., Desjardins, G., Swirszcz, G., Dal- ibard, V ., Sohl-Dickstein, J., and Schoenholz, S. S. Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H

URLhttps://arxiv.org/abs/2110.01765. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H. Going deeper with image transformers,

work page arXiv

[8] [8]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

URLhttps://arxiv.org/abs/2103.17239. Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. Deepnet: Scaling transformers to 1,000 layers,

work page arXiv

[9] [9]

Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y

URLhttps://arxiv.org/abs/2203.00555. Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y . On layer normalization in the transformer architecture,

work page arXiv

[10] [10]

arXiv:2002.04745 [cs, stat] , author =

URL https://arxiv.org/abs/2002.04745. Zhang, B. and Sennrich, R. Root mean square layer nor- malization,

work page arXiv 2002

[11] [11]

Root Mean Square Layer Normalization

URL https://arxiv.org/abs/ 1910.07467. Zhang, H., Dauphin, Y . N., and Ma, T. Fixup initialization: Residual learning without normalization,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[12] [12]

Fixup Initialization: Residual Learning Without Normalization

URL https://arxiv.org/abs/1901.09321. Zhu, J., Chen, X., He, K., LeCun, Y ., and Liu, Z. Trans- formers without normalization,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[13] [13]

Transformers without Normalization

URL https: //arxiv.org/abs/2503.10622. 9 Gated Removal of Normalization A. Proofs for Section 4 Preliminaries and assumptions We recall r(h) := p ∥h∥2 2/d+ε and Dγ := diag(γ), D˜γ:= diag(˜γ). Expectations E[·] are over mini-batch elements and sequence positions unless stated. A.1. Proof of Proposition 4.1 Proof. A map Norm is 0-homogeneous if Norm(αh) = N...

work page arXiv

[14] [14]

TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup

Warmup uses 5% of total steps for every run. TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup. At the warmup boundary, each tapered layer computes its alignment scalar c using γ-weighted EMA estimates of the quantities in Section 3.4, copies γ→˜γ , and then freezes c. After warmup, we...

work page 2025