pith. sign in

arxiv: 2407.09577 · v5 · submitted 2024-07-12 · 💻 cs.LG

FlashNorm: Fast Normalization for Transformers

Pith reviewed 2026-05-23 23:15 UTC · model grok-4.3

classification 💻 cs.LG
keywords RMSNormnormalizationtransformerinference optimizationGPU latencylinear layerscale invarianceLLM
0
0 comments X

The pith

FlashNorm rewrites RMSNorm plus linear layer to fold weights and defer RMS so the operations run in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the sequential bottleneck between RMSNorm and the following linear layer can be removed by an exact algebraic rewrite. Normalization scale factors are absorbed into the linear weights, and the scalar RMS is computed only after the matrix multiplication completes. This identity preserves every numerical result while allowing the vector RMS work and the matrix multiply to overlap on hardware with separate execution units. Scale invariance of RMS further allows the first of two stacked RMSNorms to be dropped entirely. The changes produce identical model behavior and measured latency reductions on current GPUs.

Core claim

FlashNorm is an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. By the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely.

What carries the argument

The exact algebraic reformulation that folds RMSNorm weights into the linear projection and postpones the root-mean-square computation until after the matrix multiplication.

Load-bearing premise

Hardware has distinct vector and matrix execution units so the RMS calculation blocks the subsequent matrix multiplication.

What would settle it

A latency measurement on an NVIDIA T4 GPU showing no reduction for the norm-then-project operation when the reformulated weights and deferred RMS are used.

Figures

Figures reproduced from arXiv: 2407.09577 by Andrew Wasielewski, Filip Makraduli, Matthew Clapp, Nils Graef.

Figure 1
Figure 1. Figure 1: Mathematically identical implementations of RMSNorm followed by a linear layer: (a) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Elimination of bias vector β⃗: (a) Before elimination with β⃗ between normalization weights ⃗g and linear layer. (b) Optimized version with new bias term ⃗c ∗ = ⃗c + β⃗ W at the output. 1.2 Merging mean centering into a preceding linear layer Note that LayerNorm consists of mean centering followed by RMSNorm. If the mean centering is preceded by a linear layer with weight matrix V, then we can eliminate th… view at source ↗
Figure 3
Figure 3. Figure 3: Elimination of mean centering: (a) Original weight matrix [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FFN with ReLU and preceding flash normalization: (a) unoptimized version; (b) optimized [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FFN with GLU variant and preceding flash normalization: (a) unoptimized version; (b) [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FFN with ReGLU (or bilinear GLU) and preceding flash normalization: (a) unoptimized [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Flash normalization for scaled dot-product attention with RoPE: (a) unoptimized version; [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Linear layer with flash normalization followed by a second normalization: (a) unoptimized [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: QK-normalization with RoPE: (a) unoptimized version; (b) optimized version. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Timing diagrams for n = 512, m = 128: (a) without deferred normalization; (b) with interleaved scaling and vector-matrix multiplication; (c) with deferred normalization. on an M1 MacBook Air. This throughput increases to only 225 tokens per second when we remove RMSNorm entirely. Therefore, the maximum possible speedup of any RMSNorm optimization is ≤ 10% for this model. For many applications, the main ad… view at source ↗
read the original abstract

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. Additionally, by the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely -- a mathematically identical simplification that removes the pre-attention RMSNorm in models using QKV-normalization (e.g., Gemma~4) and in MLA-models with latent normalization (e.g., DeepSeek-V2, Mistral Small 4, and OpenMythos). The same techniques extend to LayerNorm, Dynamic Tanh (DyT), feed-forward networks with GLU variants, and RoPE-based attention. On an NVIDIA T4 GPU, FlashNorm achieves 33 - 35% lower latency on the norm-then-project operation in the compute-bound (prefill) regime at SmolLM2-135M scale, and 12 - 14% at Llama-7B scale. We verify zero-loss weight folding on three models. Beyond inference speed, FlashNorm simplifies model implementations by reducing parameter tensor count. Watch our explainer video https://youtu.be/GEuJv34_XgU?si and see https://github.com/OpenMachine-ai/transformer-tricks for code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents FlashNorm, an exact algebraic reformulation of RMSNorm followed by a linear layer. It folds the normalization weights (gamma) into the subsequent linear weights and defers the scalar RMS division until after the matrix multiplication, enabling parallel execution on hardware with separate vector and matrix units. By RMS scale invariance, an RMSNorm-linear-RMSNorm sequence allows elimination of the first RMSNorm entirely. The same techniques are stated to extend to LayerNorm, DyT, GLU FFNs, and RoPE attention. Reported results include 33-35% latency reduction on T4 GPU for SmolLM2-135M and 12-14% for Llama-7B in the prefill regime, plus zero-loss verification of the folding on three models.

Significance. If the algebraic identities hold, the contribution is a parameter-free, exact simplification that reduces inference latency and parameter count in transformers without altering outputs. The direct use of standard RMS and linear-layer properties, combined with explicit zero-loss checks, provides a reproducible optimization applicable to multiple model families.

minor comments (2)
  1. The zero-loss verification is asserted for three models, but the specific models, layers tested, and numerical tolerance used are not detailed in the abstract or visible excerpts; adding a short table or paragraph with these would improve reproducibility.
  2. The latency measurements are reported for two model scales on a single GPU (T4); a brief note on whether the relative gains hold under different batch sizes or on other hardware with vector/matrix separation would clarify the scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation to accept the manuscript. Their summary accurately reflects the core contributions of FlashNorm as an exact algebraic reformulation enabling parallel execution and parameter reduction without altering model outputs.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The central claim is an exact algebraic reformulation of RMSNorm + linear layer (weight folding of gamma into W, deferral of scalar RMS to post-matmul) plus scale-invariance removal of the first RMSNorm in RMS-Lin-RMS chains. These steps follow directly from the definitions of RMSNorm and matrix multiplication with no fitted parameters, no predictions that reduce to inputs by construction, and no load-bearing self-citations or imported uniqueness theorems. The paper reports empirical verification of zero loss but the derivation itself is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The technique rests on standard algebraic identities of RMSNorm and linear layers with no new free parameters or invented entities.

axioms (1)
  • standard math RMS is scale invariant
    Invoked to justify complete removal of the first RMSNorm when two RMSNorms bracket a linear layer.

pith-pipeline@v0.9.0 · 5843 in / 1231 out tokens · 23825 ms · 2026-05-23T23:15:44.594560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 9 internal anchors

  1. [1]

    Root Mean Square Layer Normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. October 2019. arXiv:1910.07467

  2. [2]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. February 2023. arXiv:2302.13971

  3. [3]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. October 2023. arXiv:2310.06825

  4. [4]

    Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

    Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. OpenELM: An efficient language model family with open-source training and inference framework. April 2024. arXiv:2404.14619

  5. [5]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization. July 2016. arXiv:1607.06450

  6. [6]

    Transformers without Normalization

    Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without Normalization. 2025. arXiv:2503.10622

  7. [7]

    Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA

    Nils Graef and Andrew Wasielewski. Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA. 2025. arXiv:2503.05840

  8. [8]

    Transformer tricks

    OpenMachine. Transformer tricks. 2024. URL https://github.com/OpenMachine-ai/ transformer-tricks

  9. [9]

    Transformer tricks: Removing weights for skipless transformers

    Nils Graef. Transformer tricks: Removing weights for skipless transformers. April 2024. arXiv:2404.12362

  10. [10]

    Transformer tricks: Precomputing the first layer

    Nils Graef. Transformer tricks: Precomputing the first layer. February 2024. arXiv:2402.13388

  11. [11]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. June 2017. arXiv:1706.03762

  12. [12]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. May 2022. arXiv:2205.14135

  13. [13]

    Rectifier (neural networks), 2024

    Wikipedia. Rectifier (neural networks), 2024. Accessed June-2024

  14. [14]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU Variants Improve Transformer. February 2020. arXiv:2002.05202

  15. [15]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding. April 2021. arXiv:2104.09864

  16. [16]

    Query-key normal- ization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. October 2020. arXiv:2010.04245

  17. [17]

    FlashNorm

    OpenMachine. FlashNorm. 2024. URL https://huggingface.co/open-machine/ FlashNorm. 9

  18. [18]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, et al. PaLM: Scaling language modeling with Pathways. April 2022. arXiv:2204.02311

  19. [19]

    Transformers

    HuggingFace. Transformers. URL https://huggingface.co/docs/transformers

  20. [20]

    whisper.cpp

    Georgi Gerganov. whisper.cpp. . URL https://github.com/ggml-org/whisper.cpp

  21. [21]

    llama.cpp

    Georgi Gerganov. llama.cpp. . URL https://github.com/ggml-org/llama.cpp

  22. [22]

    vLLM Project. vLLM. URL https://github.com/vllm-project/vllm

  23. [23]

    llamafile

    Mozilla. llamafile. URL https://github.com/Mozilla-Ocho/llamafile

  24. [24]

    LM Studio

    LM Studio. LM Studio. URL https://lmstudio.ai

  25. [25]

    Ollama. Ollama. URL https://github.com/ollama/ollama

  26. [26]

    SGLang. SGLang. URL https://github.com/sgl-project/sglang. 10