FlashNorm: Fast Normalization for Transformers
Pith reviewed 2026-05-23 23:15 UTC · model grok-4.3
The pith
FlashNorm rewrites RMSNorm plus linear layer to fold weights and defer RMS so the operations run in parallel.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashNorm is an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. By the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely.
What carries the argument
The exact algebraic reformulation that folds RMSNorm weights into the linear projection and postpones the root-mean-square computation until after the matrix multiplication.
Load-bearing premise
Hardware has distinct vector and matrix execution units so the RMS calculation blocks the subsequent matrix multiplication.
What would settle it
A latency measurement on an NVIDIA T4 GPU showing no reduction for the norm-then-project operation when the reformulated weights and deferred RMS are used.
Figures
read the original abstract
Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. Additionally, by the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely -- a mathematically identical simplification that removes the pre-attention RMSNorm in models using QKV-normalization (e.g., Gemma~4) and in MLA-models with latent normalization (e.g., DeepSeek-V2, Mistral Small 4, and OpenMythos). The same techniques extend to LayerNorm, Dynamic Tanh (DyT), feed-forward networks with GLU variants, and RoPE-based attention. On an NVIDIA T4 GPU, FlashNorm achieves 33 - 35% lower latency on the norm-then-project operation in the compute-bound (prefill) regime at SmolLM2-135M scale, and 12 - 14% at Llama-7B scale. We verify zero-loss weight folding on three models. Beyond inference speed, FlashNorm simplifies model implementations by reducing parameter tensor count. Watch our explainer video https://youtu.be/GEuJv34_XgU?si and see https://github.com/OpenMachine-ai/transformer-tricks for code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FlashNorm, an exact algebraic reformulation of RMSNorm followed by a linear layer. It folds the normalization weights (gamma) into the subsequent linear weights and defers the scalar RMS division until after the matrix multiplication, enabling parallel execution on hardware with separate vector and matrix units. By RMS scale invariance, an RMSNorm-linear-RMSNorm sequence allows elimination of the first RMSNorm entirely. The same techniques are stated to extend to LayerNorm, DyT, GLU FFNs, and RoPE attention. Reported results include 33-35% latency reduction on T4 GPU for SmolLM2-135M and 12-14% for Llama-7B in the prefill regime, plus zero-loss verification of the folding on three models.
Significance. If the algebraic identities hold, the contribution is a parameter-free, exact simplification that reduces inference latency and parameter count in transformers without altering outputs. The direct use of standard RMS and linear-layer properties, combined with explicit zero-loss checks, provides a reproducible optimization applicable to multiple model families.
minor comments (2)
- The zero-loss verification is asserted for three models, but the specific models, layers tested, and numerical tolerance used are not detailed in the abstract or visible excerpts; adding a short table or paragraph with these would improve reproducibility.
- The latency measurements are reported for two model scales on a single GPU (T4); a brief note on whether the relative gains hold under different batch sizes or on other hardware with vector/matrix separation would clarify the scope.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation to accept the manuscript. Their summary accurately reflects the core contributions of FlashNorm as an exact algebraic reformulation enabling parallel execution and parameter reduction without altering model outputs.
Circularity Check
No significant circularity identified
full rationale
The central claim is an exact algebraic reformulation of RMSNorm + linear layer (weight folding of gamma into W, deferral of scalar RMS to post-matmul) plus scale-invariance removal of the first RMSNorm in RMS-Lin-RMS chains. These steps follow directly from the definitions of RMSNorm and matrix multiplication with no fitted parameters, no predictions that reduce to inputs by construction, and no load-bearing self-citations or imported uniqueness theorems. The paper reports empirical verification of zero loss but the derivation itself is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math RMS is scale invariant
Reference graph
Works this paper leans on
-
[1]
Root Mean Square Layer Normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. October 2019. arXiv:1910.07467
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. February 2023. arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. October 2023. arXiv:2310.06825
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. OpenELM: An efficient language model family with open-source training and inference framework. April 2024. arXiv:2404.14619
-
[5]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization. July 2016. arXiv:1607.06450
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Transformers without Normalization
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without Normalization. 2025. arXiv:2503.10622
-
[7]
Nils Graef and Andrew Wasielewski. Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA. 2025. arXiv:2503.05840
-
[8]
OpenMachine. Transformer tricks. 2024. URL https://github.com/OpenMachine-ai/ transformer-tricks
work page 2024
-
[9]
Transformer tricks: Removing weights for skipless transformers
Nils Graef. Transformer tricks: Removing weights for skipless transformers. April 2024. arXiv:2404.12362
-
[10]
Transformer tricks: Precomputing the first layer
Nils Graef. Transformer tricks: Precomputing the first layer. February 2024. arXiv:2402.13388
-
[11]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. June 2017. arXiv:1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. May 2022. arXiv:2205.14135
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[13]
Rectifier (neural networks), 2024
Wikipedia. Rectifier (neural networks), 2024. Accessed June-2024
work page 2024
-
[14]
GLU Variants Improve Transformer
Noam Shazeer. GLU Variants Improve Transformer. February 2020. arXiv:2002.05202
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[15]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding. April 2021. arXiv:2104.09864
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Query-key normal- ization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. October 2020. arXiv:2010.04245
- [17]
-
[18]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, et al. PaLM: Scaling language modeling with Pathways. April 2022. arXiv:2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [19]
- [20]
- [21]
-
[22]
vLLM Project. vLLM. URL https://github.com/vllm-project/vllm
- [23]
- [24]
-
[25]
Ollama. Ollama. URL https://github.com/ollama/ollama
-
[26]
SGLang. SGLang. URL https://github.com/sgl-project/sglang. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.