FlashNorm: Fast Normalization for Transformers

Andrew Wasielewski; Filip Makraduli; Matthew Clapp; Nils Graef

arxiv: 2407.09577 · v5 · submitted 2024-07-12 · 💻 cs.LG

FlashNorm: Fast Normalization for Transformers

Nils Graef , Filip Makraduli , Andrew Wasielewski , Matthew Clapp This is my paper

Pith reviewed 2026-05-23 23:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords RMSNormnormalizationtransformerinference optimizationGPU latencylinear layerscale invarianceLLM

0 comments

The pith

FlashNorm rewrites RMSNorm plus linear layer to fold weights and defer RMS so the operations run in parallel.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the sequential bottleneck between RMSNorm and the following linear layer can be removed by an exact algebraic rewrite. Normalization scale factors are absorbed into the linear weights, and the scalar RMS is computed only after the matrix multiplication completes. This identity preserves every numerical result while allowing the vector RMS work and the matrix multiply to overlap on hardware with separate execution units. Scale invariance of RMS further allows the first of two stacked RMSNorms to be dropped entirely. The changes produce identical model behavior and measured latency reductions on current GPUs.

Core claim

FlashNorm is an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. By the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely.

What carries the argument

The exact algebraic reformulation that folds RMSNorm weights into the linear projection and postpones the root-mean-square computation until after the matrix multiplication.

Load-bearing premise

Hardware has distinct vector and matrix execution units so the RMS calculation blocks the subsequent matrix multiplication.

What would settle it

A latency measurement on an NVIDIA T4 GPU showing no reduction for the norm-then-project operation when the reformulated weights and deferred RMS are used.

Figures

Figures reproduced from arXiv: 2407.09577 by Andrew Wasielewski, Filip Makraduli, Matthew Clapp, Nils Graef.

**Figure 2.** Figure 2: Elimination of bias vector β⃗: (a) Before elimination with β⃗ between normalization weights ⃗g and linear layer. (b) Optimized version with new bias term ⃗c ∗ = ⃗c + β⃗ W at the output. 1.2 Merging mean centering into a preceding linear layer Note that LayerNorm consists of mean centering followed by RMSNorm. If the mean centering is preceded by a linear layer with weight matrix V, then we can eliminate th… view at source ↗

**Figure 3.** Figure 3: Elimination of mean centering: (a) Original weight matrix [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: FFN with ReLU and preceding flash normalization: (a) unoptimized version; (b) optimized [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: FFN with GLU variant and preceding flash normalization: (a) unoptimized version; (b) [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: FFN with ReGLU (or bilinear GLU) and preceding flash normalization: (a) unoptimized [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: Flash normalization for scaled dot-product attention with RoPE: (a) unoptimized version; [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Linear layer with flash normalization followed by a second normalization: (a) unoptimized [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: QK-normalization with RoPE: (a) unoptimized version; (b) optimized version. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Timing diagrams for n = 512, m = 128: (a) without deferred normalization; (b) with interleaved scaling and vector-matrix multiplication; (c) with deferred normalization. on an M1 MacBook Air. This throughput increases to only 225 tokens per second when we remove RMSNorm entirely. Therefore, the maximum possible speedup of any RMSNorm optimization is ≤ 10% for this model. For many applications, the main ad… view at source ↗

read the original abstract

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing parallel execution. We present FlashNorm, an exact reformulation of RMSNorm followed by a linear layer that (i) eliminates the normalization weights by folding them into the subsequent linear layer, and (ii) defers the scalar RMS normalization to the output of the matrix multiplication, enabling the two operations to execute in parallel. Additionally, by the scale invariance of RMS, an RMSNorm followed by a linear layer followed by another RMSNorm allows the first RMSNorm to be eliminated entirely -- a mathematically identical simplification that removes the pre-attention RMSNorm in models using QKV-normalization (e.g., Gemma~4) and in MLA-models with latent normalization (e.g., DeepSeek-V2, Mistral Small 4, and OpenMythos). The same techniques extend to LayerNorm, Dynamic Tanh (DyT), feed-forward networks with GLU variants, and RoPE-based attention. On an NVIDIA T4 GPU, FlashNorm achieves 33 - 35% lower latency on the norm-then-project operation in the compute-bound (prefill) regime at SmolLM2-135M scale, and 12 - 14% at Llama-7B scale. We verify zero-loss weight folding on three models. Beyond inference speed, FlashNorm simplifies model implementations by reducing parameter tensor count. Watch our explainer video https://youtu.be/GEuJv34_XgU?si and see https://github.com/OpenMachine-ai/transformer-tricks for code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashNorm is an exact algebraic rewrite that folds RMSNorm weights into the linear layer and defers the division so the operations can run in parallel on hardware with separate units, plus drops the first of two RMSNorms around a linear via scale invariance.

read the letter

FlashNorm rewrites RMSNorm followed by a linear layer to fold the scale weights into the linear weights and push the RMS computation after the matrix multiply. This lets the two run in parallel on hardware with separate vector and matrix units. They also show that scale invariance lets you remove the first RMSNorm when you have RMSNorm-linear-RMSNorm, which applies to QKV-normalized models and some MLA ones like DeepSeek-V2 and Mistral Small 4. The algebra is straightforward and exact, with no approximations. They confirm identical outputs after folding on three models. The latency numbers come from an NVIDIA T4: 33-35% faster on small models and 12-14% on Llama-7B for the relevant operation in prefill. It also reduces the parameter count by one tensor per layer. This is a practical optimization that does what it claims without changing model behavior. The GitHub code and video make it easy to check. The main limitation is the hardware assumption. If your GPU doesn't have the vector unit blocking the matmul, the speedup won't appear. The T4 is an older card, so gains on current hardware need separate measurement. The extensions to LayerNorm and GLU are mentioned but not the focus of the experiments. Engineers working on fast transformer inference will find this directly useful. It is worth sending to peer review because the math is clean and the results are measurable.

Referee Report

0 major / 2 minor

Summary. The manuscript presents FlashNorm, an exact algebraic reformulation of RMSNorm followed by a linear layer. It folds the normalization weights (gamma) into the subsequent linear weights and defers the scalar RMS division until after the matrix multiplication, enabling parallel execution on hardware with separate vector and matrix units. By RMS scale invariance, an RMSNorm-linear-RMSNorm sequence allows elimination of the first RMSNorm entirely. The same techniques are stated to extend to LayerNorm, DyT, GLU FFNs, and RoPE attention. Reported results include 33-35% latency reduction on T4 GPU for SmolLM2-135M and 12-14% for Llama-7B in the prefill regime, plus zero-loss verification of the folding on three models.

Significance. If the algebraic identities hold, the contribution is a parameter-free, exact simplification that reduces inference latency and parameter count in transformers without altering outputs. The direct use of standard RMS and linear-layer properties, combined with explicit zero-loss checks, provides a reproducible optimization applicable to multiple model families.

minor comments (2)

The zero-loss verification is asserted for three models, but the specific models, layers tested, and numerical tolerance used are not detailed in the abstract or visible excerpts; adding a short table or paragraph with these would improve reproducibility.
The latency measurements are reported for two model scales on a single GPU (T4); a brief note on whether the relative gains hold under different batch sizes or on other hardware with vector/matrix separation would clarify the scope.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation to accept the manuscript. Their summary accurately reflects the core contributions of FlashNorm as an exact algebraic reformulation enabling parallel execution and parameter reduction without altering model outputs.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The central claim is an exact algebraic reformulation of RMSNorm + linear layer (weight folding of gamma into W, deferral of scalar RMS to post-matmul) plus scale-invariance removal of the first RMSNorm in RMS-Lin-RMS chains. These steps follow directly from the definitions of RMSNorm and matrix multiplication with no fitted parameters, no predictions that reduce to inputs by construction, and no load-bearing self-citations or imported uniqueness theorems. The paper reports empirical verification of zero loss but the derivation itself is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The technique rests on standard algebraic identities of RMSNorm and linear layers with no new free parameters or invented entities.

axioms (1)

standard math RMS is scale invariant
Invoked to justify complete removal of the first RMSNorm when two RMSNorms bracket a linear layer.

pith-pipeline@v0.9.0 · 5843 in / 1231 out tokens · 23825 ms · 2026-05-23T23:15:44.594560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 9 internal anchors

[1]

Root Mean Square Layer Normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. October 2019. arXiv:1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. February 2023. arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. October 2023. arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. OpenELM: An efficient language model family with open-source training and inference framework. April 2024. arXiv:2404.14619

work page arXiv 2024
[5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization. July 2016. arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without Normalization. 2025. arXiv:2503.10622

work page arXiv 2025
[7]

Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA

Nils Graef and Andrew Wasielewski. Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA. 2025. arXiv:2503.05840

work page arXiv 2025
[8]

Transformer tricks

OpenMachine. Transformer tricks. 2024. URL https://github.com/OpenMachine-ai/ transformer-tricks

work page 2024
[9]

Transformer tricks: Removing weights for skipless transformers

Nils Graef. Transformer tricks: Removing weights for skipless transformers. April 2024. arXiv:2404.12362

work page arXiv 2024
[10]

Transformer tricks: Precomputing the first layer

Nils Graef. Transformer tricks: Precomputing the first layer. February 2024. arXiv:2402.13388

work page arXiv 2024
[11]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. June 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. May 2022. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Rectifier (neural networks), 2024

Wikipedia. Rectifier (neural networks), 2024. Accessed June-2024

work page 2024
[14]

GLU Variants Improve Transformer

Noam Shazeer. GLU Variants Improve Transformer. February 2020. arXiv:2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding. April 2021. arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Query-key normal- ization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. October 2020. arXiv:2010.04245

work page arXiv 2020
[17]

FlashNorm

OpenMachine. FlashNorm. 2024. URL https://huggingface.co/open-machine/ FlashNorm. 9

work page 2024
[18]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, et al. PaLM: Scaling language modeling with Pathways. April 2022. arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Transformers

HuggingFace. Transformers. URL https://huggingface.co/docs/transformers

work page
[20]

whisper.cpp

Georgi Gerganov. whisper.cpp. . URL https://github.com/ggml-org/whisper.cpp

work page
[21]

llama.cpp

Georgi Gerganov. llama.cpp. . URL https://github.com/ggml-org/llama.cpp

work page
[22]

vLLM Project. vLLM. URL https://github.com/vllm-project/vllm

work page
[23]

llamafile

Mozilla. llamafile. URL https://github.com/Mozilla-Ocho/llamafile

work page
[24]

LM Studio

LM Studio. LM Studio. URL https://lmstudio.ai

work page
[25]

Ollama. Ollama. URL https://github.com/ollama/ollama

work page
[26]

SGLang. SGLang. URL https://github.com/sgl-project/sglang. 10

work page

[1] [1]

Root Mean Square Layer Normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. October 2019. arXiv:1910.07467

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. February 2023. arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B. October 2023. arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Openelm: An efficient language model family with open training and inference framework.arXiv:2404.14619,

Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, and Mohammad Rastegari. OpenELM: An efficient language model family with open-source training and inference framework. April 2024. arXiv:2404.14619

work page arXiv 2024

[5] [5]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer Normalization. July 2016. arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, and Zhuang Liu. Transformers without Normalization. 2025. arXiv:2503.10622

work page arXiv 2025

[7] [7]

Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA

Nils Graef and Andrew Wasielewski. Slim attention: cut your context memory in half without loss of accuracy – K-cache is all you need for MHA. 2025. arXiv:2503.05840

work page arXiv 2025

[8] [8]

Transformer tricks

OpenMachine. Transformer tricks. 2024. URL https://github.com/OpenMachine-ai/ transformer-tricks

work page 2024

[9] [9]

Transformer tricks: Removing weights for skipless transformers

Nils Graef. Transformer tricks: Removing weights for skipless transformers. April 2024. arXiv:2404.12362

work page arXiv 2024

[10] [10]

Transformer tricks: Precomputing the first layer

Nils Graef. Transformer tricks: Precomputing the first layer. February 2024. arXiv:2402.13388

work page arXiv 2024

[11] [11]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. June 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. May 2022. arXiv:2205.14135

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Rectifier (neural networks), 2024

Wikipedia. Rectifier (neural networks), 2024. Accessed June-2024

work page 2024

[14] [14]

GLU Variants Improve Transformer

Noam Shazeer. GLU Variants Improve Transformer. February 2020. arXiv:2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [15]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with Rotary Position Embedding. April 2021. arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2021

[16] [16]

Query-key normal- ization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normal- ization for transformers. October 2020. arXiv:2010.04245

work page arXiv 2020

[17] [17]

FlashNorm

OpenMachine. FlashNorm. 2024. URL https://huggingface.co/open-machine/ FlashNorm. 9

work page 2024

[18] [18]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, et al. PaLM: Scaling language modeling with Pathways. April 2022. arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Transformers

HuggingFace. Transformers. URL https://huggingface.co/docs/transformers

work page

[20] [20]

whisper.cpp

Georgi Gerganov. whisper.cpp. . URL https://github.com/ggml-org/whisper.cpp

work page

[21] [21]

llama.cpp

Georgi Gerganov. llama.cpp. . URL https://github.com/ggml-org/llama.cpp

work page

[22] [22]

vLLM Project. vLLM. URL https://github.com/vllm-project/vllm

work page

[23] [23]

llamafile

Mozilla. llamafile. URL https://github.com/Mozilla-Ocho/llamafile

work page

[24] [24]

LM Studio

LM Studio. LM Studio. URL https://lmstudio.ai

work page

[25] [25]

Ollama. Ollama. URL https://github.com/ollama/ollama

work page

[26] [26]

SGLang. SGLang. URL https://github.com/sgl-project/sglang. 10

work page