pith. sign in

arxiv: 2602.10408 · v2 · pith:ALKCTMJHnew · submitted 2026-02-11 · 💻 cs.LG · cs.CL

Gated Normalization Removal and Scale Anchoring in Pre-Norm Transformers

Pith reviewed 2026-05-21 14:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords normalization removalpre-norm transformersTaperNormscale anchoringgated mechanismsinference efficiencytransformer optimizationnorm-free models
0
0 comments X

The pith

Pre-norm transformers can have internal normalization layers tapered to fixed linear maps with only small validation-loss increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops TaperNorm, a gated approach that starts with standard RMSNorm or LayerNorm and gradually replaces them with learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed and the maps fold into adjacent projections. Experiments in pre-training and fine-tuning show this tapering works with small validation-loss increases. The work reveals that the final normalization anchors the scale of the pre-logit representation so that radial changes in the last hidden state do not directly reduce cross-entropy loss. A fixed-target scale loss serves as an explicit alternative anchor that enables fully norm-free models, and the approach improves throughput up to 1.18 times in KV-cached autoregressive decoding after folding.

Core claim

Normalization layers are not required throughout pre-norm transformers; a gated tapering process can transition them to fixed sample-independent maps during training, producing only small validation-loss increases in the tested regimes while allowing the resulting maps to be folded into linear layers for inference. The final normalization layer has a distinct anchoring role that prevents loss reduction through simple scaling of logits, and an explicit scale-target loss can substitute for it to support norm-free ablations.

What carries the argument

TaperNorm, a gated normalization-removal mechanism that begins with RMSNorm or LayerNorm and gradually tapers to learned sample-independent linear or affine maps until the gate reaches zero.

If this is right

  • Once the gate reaches zero the learned maps can be folded into adjacent linear projections, removing per-token normalization computation at inference time.
  • A fixed-target scale loss provides an explicit alternative to the final normalization anchor and supports fully norm-free transformer variants.
  • Tapering internal norms yields up to 1.14 times higher throughput with explicit scaling and up to 1.18 times after folding in KV-cached autoregressive decoding.
  • With the final normalization anchor present, changes to the magnitude of the last hidden state do not directly improve cross-entropy loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scale-anchoring observation could guide the design of new output heads or loss terms in other sequence models that currently rely on final normalization.
  • Folding the tapered maps suggests a general route to simplify deployed transformer graphs by absorbing normalization constants into weights.
  • The gradual-taper training schedule might be adapted to other removable components such as residual connections or attention biases.

Load-bearing premise

The gradual tapering schedule and the specific pre-training and fine-tuning regimes tested will generalize when the gate reaches zero in larger production-scale models and tasks.

What would settle it

Train a large model to full gate-zero removal and measure whether validation loss stays within a small margin of the baseline across multiple tasks and scales.

Figures

Figures reproduced from arXiv: 2602.10408 by Andrei Kanavalau, Carmen Amo Alonso, Sanjay Lall.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss vs. step for the Baseline and Internal-Taper (+aux). training details between same-size models are identical [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average gradient norms across all Transformer blocks by weight type. Without explicit scale anchoring, gradients cluster primarily by presence vs. absence of the final normalization. With the fixed-target scale loss enabled, the gradient-magnitude gap between Internal-Taper and All-Taper largely disappears [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Normalization layers are standard in transformers, but it is not clear whether their sample-dependent computations are necessary throughout both training and inference. This work develops a gated normalization-removal approach for pre-norm transformers. The approach is implemented using TaperNorm, which starts from standard RMSNorm/LayerNorm and gradually tapers to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are no longer computed in the tapered layers and the resulting maps can be folded into adjacent linear projections. The results indicate that internal normalization can be tapered in the tested pre-training and fine-tuning settings with small validation-loss increases. Our approach helps reveal a distinct role for final normalization, namely that it anchors the scale of the pre-logit representation. With this anchor present, radial changes in the last hidden state do not directly reduce the loss; when it is removed, reducing cross-entropy can be achieved by increasing logit magnitudes. A fixed-target scale loss provides an explicit alternative anchor and enables fully norm-free ablations in the tested regimes. Finally, in a KV-cached autoregressive decoding benchmark, tapering internal norms gives up to $1.14\times$ higher throughput with explicit scaling operations and up to $1.18\times$ after folding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TaperNorm, a gated normalization-removal method for pre-norm transformers. Starting from standard RMSNorm or LayerNorm, it applies a gradual tapering schedule to learned sample-independent linear or affine maps. Once the gate reaches zero, per-token statistics are eliminated and the maps can be folded into adjacent linear projections. The work reports that this yields small validation-loss increases in the tested pre-training and fine-tuning regimes, identifies a distinct scale-anchoring role for the final normalization layer, introduces a fixed-target scale loss that enables fully norm-free ablations, and measures up to 1.18× throughput gains in a KV-cached autoregressive decoding benchmark.

Significance. If the empirical outcomes hold under more rigorous controls, the contribution would be significant for simplifying transformer inference by removing internal normalization overhead and for clarifying the functional role of the final normalization as a scale anchor rather than a per-sample statistic. The explicit scale-loss alternative and the folding-based speedups could inform practical architecture choices in production models.

major comments (2)
  1. The central empirical claim—that internal normalization can be tapered to zero with only small validation-loss increases—rests on results whose robustness cannot be assessed. The abstract and results description provide no error bars, exact dataset sizes, hyperparameter schedules, or ablation controls that would allow evaluation of whether the observed loss deltas are statistically meaningful or sensitive to the particular pre-training and fine-tuning regimes tested.
  2. The generalization argument for production-scale models is load-bearing for the practical impact claim but is not supported by the current experiments. Nothing in the construction bounds the loss penalty or gradient dynamics once the gate reaches zero at larger model scales or different tasks; the tested regimes may not be representative, as noted in the stress-test concern.
minor comments (2)
  1. The description of the tapering schedule and gate parameterization would benefit from an explicit equation or pseudocode block early in the methods section to make the transition from standard normalization to the sample-independent map fully reproducible.
  2. Figure captions and table headers should explicitly state whether the reported throughput numbers include or exclude the cost of the explicit scaling operations versus the folded version.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive feedback. We address each major comment below, clarifying our experimental details and acknowledging the limits of our current scale of evaluation. Revisions have been made to improve transparency and to explicitly discuss generalization constraints.

read point-by-point responses
  1. Referee: The central empirical claim—that internal normalization can be tapered to zero with only small validation-loss increases—rests on results whose robustness cannot be assessed. The abstract and results description provide no error bars, exact dataset sizes, hyperparameter schedules, or ablation controls that would allow evaluation of whether the observed loss deltas are statistically meaningful or sensitive to the particular pre-training and fine-tuning regimes tested.

    Authors: We agree that the original presentation lacked sufficient detail for assessing statistical robustness. In the revised manuscript we now report error bars over three independent random seeds for all main pre-training and fine-tuning curves, state the precise dataset sizes and token counts used (C4 for pre-training, GLUE/SuperGLUE subsets for fine-tuning), and move the full hyperparameter schedules and tapering schedules into a new appendix section. Additional ablation tables controlling for gate initialization variance and schedule aggressiveness have also been added to demonstrate that the reported loss deltas remain small and consistent under these variations. revision: yes

  2. Referee: The generalization argument for production-scale models is load-bearing for the practical impact claim but is not supported by the current experiments. Nothing in the construction bounds the loss penalty or gradient dynamics once the gate reaches zero at larger model scales or different tasks; the tested regimes may not be representative, as noted in the stress-test concern.

    Authors: We acknowledge that our empirical results are confined to the model sizes and tasks described in the paper and that we provide neither theoretical bounds nor empirical data at substantially larger scales. The TaperNorm construction itself is scale-agnostic—the per-layer gates are learned independently of hidden dimension—but this does not guarantee identical loss behavior at production scales. In the revision we have expanded the Limitations and Future Work sections to state these constraints explicitly and to recommend stress-testing on larger models as necessary follow-up work. revision: partial

standing simulated objections not resolved
  • We cannot supply empirical results or strict theoretical bounds on loss penalty or gradient dynamics for model scales or task distributions beyond those evaluated in the current experiments.

Circularity Check

0 steps flagged

No circularity: empirical method validated by direct experiments

full rationale

The paper introduces TaperNorm as a practical gated tapering schedule for removing internal RMSNorm/LayerNorm layers in pre-norm transformers, then reports measured validation-loss deltas from pre-training and fine-tuning runs. No derivation chain exists that reduces a claimed prediction or first-principles result back to a fitted parameter, self-citation, or ansatz by construction; the central observations are straightforward empirical outcomes of the implemented schedule. The text contains no load-bearing self-citations, uniqueness theorems, or renamings of known results. The work is therefore self-contained as an experimental demonstration whose validity rests on the reported benchmarks rather than any internal logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations in unspecified pre-training and fine-tuning regimes; no explicit free parameters, axioms, or invented entities are introduced beyond the TaperNorm gating mechanism itself.

pith-pipeline@v0.9.0 · 5753 in / 1170 out tokens · 38261 ms · 2026-05-21T14:14:54.079881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

    cs.LG 2026-04 unverdicted novelty 5.0

    DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    org/abs/2003.04887

    URL https://arxiv. org/abs/2003.04887. Baroni, L., Khara, G., Schaeffer, J., Subkhankulov, M., and Heimersheim, S. Transformers don’t need layernorm at inference time: Scaling layernorm removal to gpt-2 xl and the implications for mechanistic interpretability,

  2. [2]

    Brock, A., De, S., Smith, S

    URLhttps://arxiv.org/abs/2507.02559. Brock, A., De, S., Smith, S. L., and Simonyan, K. High- performance large-scale image recognition without nor- malization,

  3. [3]

    Smith, and Karen Simonyan

    URL https://arxiv.org/abs/ 2102.06171. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., 8 Gated Removal of Normalization Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800GB dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

  4. [4]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    doi: 10.48550/arXiv. 2101.00027. Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Open- webtext corpus. http://Skylion007.github. io/OpenWebTextCorpus,

  5. [5]

    Deep Residual Learning for Image Recognition

    URL https:// arxiv.org/abs/1512.03385. Ioffe, S. and Szegedy, C. Batch normalization: Acceler- ating deep network training by reducing internal covari- ate shift,

  6. [6]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    URL https://arxiv.org/abs/ 1502.03167. Martens, J., Ballard, A., Desjardins, G., Swirszcz, G., Dal- ibard, V ., Sohl-Dickstein, J., and Schoenholz, S. S. Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping,

  7. [7]

    Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H

    URLhttps://arxiv.org/abs/2110.01765. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., and J´egou, H. Going deeper with image transformers,

  8. [8]

    Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

    URLhttps://arxiv.org/abs/2103.17239. Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F. Deepnet: Scaling transformers to 1,000 layers,

  9. [9]

    Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y

    URLhttps://arxiv.org/abs/2203.00555. Xiong, R., Yang, Y ., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y ., Wang, L., and Liu, T.-Y . On layer normalization in the transformer architecture,

  10. [10]

    arXiv:2002.04745 [cs, stat] , author =

    URL https://arxiv.org/abs/2002.04745. Zhang, B. and Sennrich, R. Root mean square layer nor- malization,

  11. [11]

    Root Mean Square Layer Normalization

    URL https://arxiv.org/abs/ 1910.07467. Zhang, H., Dauphin, Y . N., and Ma, T. Fixup initialization: Residual learning without normalization,

  12. [12]

    Fixup Initialization: Residual Learning Without Normalization

    URL https://arxiv.org/abs/1901.09321. Zhu, J., Chen, X., He, K., LeCun, Y ., and Liu, Z. Trans- formers without normalization,

  13. [13]

    Transformers without Normalization

    URL https: //arxiv.org/abs/2503.10622. 9 Gated Removal of Normalization A. Proofs for Section 4 Preliminaries and assumptions We recall r(h) := p ∥h∥2 2/d+ε and Dγ := diag(γ), D˜γ:= diag(˜γ). Expectations E[·] are over mini-batch elements and sequence positions unless stated. A.1. Proof of Proposition 4.1 Proof. A map Norm is 0-homogeneous if Norm(αh) = N...

  14. [14]

    TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup

    Warmup uses 5% of total steps for every run. TaperNorm, EMA rates, and scale loss.For any layer that is tapered, we keep the gate at g= 1 during learning-rate warmup. At the warmup boundary, each tapered layer computes its alignment scalar c using γ-weighted EMA estimates of the quantities in Section 3.4, copies γ→˜γ , and then freezes c. After warmup, we...