LayerNorm Induces Recency Bias in Transformer Decoders
Pith reviewed 2026-05-18 14:07 UTC · model grok-4.3
The pith
LayerNorm combined with stacked causal self-attention produces recency bias in Transformer decoders.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that stacked causal self-attention layers combined with LayerNorm induce recency bias, meaning attention scores become higher for later tokens in the sequence. This differs from the earlier-token bias seen when causal self-attention operates without LayerNorm. The work further examines how residual connections modulate the bias and how the statistical distribution of input token embeddings influences its strength.
What carries the argument
The interaction of stacked causal self-attention with LayerNorm, which reverses the positional preference from early tokens to recent ones.
If this is right
- Residual connections alter the magnitude of the induced recency bias.
- Input embedding statistics determine how strongly the recency effect appears.
- The findings supply theoretical reasons for observed positional patterns in decoder-only models.
- The results point toward targeted adjustments in positional encoding methods.
Where Pith is reading between the lines
- The same LayerNorm interaction could be tested in encoder-only models to see if recency bias emerges there as well.
- Controlling embedding variance at initialization might offer a practical way to tune the strength of the bias without changing LayerNorm.
- Tasks that require balanced attention across long sequences may benefit from normalization variants that weaken this recency induction.
Load-bearing premise
The analysis assumes residual connections are present and that input token embeddings have a statistical distribution that interacts with LayerNorm as described.
What would settle it
Measuring attention score distributions in an otherwise identical decoder stack with LayerNorm removed and checking whether the preference for later tokens disappears.
Figures
read the original abstract
Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that causal self-attention alone produces earlier-token bias, but when combined with LayerNorm in stacked decoder layers it induces recency bias (preference for later tokens). It further analyzes how residual connections and the per-position statistics of input embeddings modulate or enable this effect, offering a theoretical account of the interaction.
Significance. If the derivations are correct, the work supplies a concrete mechanistic explanation for the recency bias routinely observed in trained Transformer decoders. Explicitly characterizing the necessary conditions on residuals and embedding moments is a strength; it turns an empirical observation into a falsifiable architectural claim and suggests targeted interventions for positional encoding.
major comments (1)
- [§3] §3 (main derivation): the recency-bias result is shown only under the assumption that residual branches preserve position-dependent first and second moments of the embeddings through the LayerNorm; the paper should state this premise as a theorem hypothesis and supply a short counter-derivation or numerical check when the residual is removed or when embeddings are drawn from a position-independent distribution.
minor comments (2)
- [Abstract] Abstract: the sentence 'we show that stacked causal self-attention layers combined with LayerNorm induce recency bias' would be clearer if it briefly indicated the role of the residual path.
- [§2] Notation: define the exact placement of LayerNorm relative to the residual addition (pre-norm vs. post-norm) before the first equation that uses it.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the helpful suggestion for improving the rigor of our theoretical presentation. We address the comment below.
read point-by-point responses
-
Referee: [§3] §3 (main derivation): the recency-bias result is shown only under the assumption that residual branches preserve position-dependent first and second moments of the embeddings through the LayerNorm; the paper should state this premise as a theorem hypothesis and supply a short counter-derivation or numerical check when the residual is removed or when embeddings are drawn from a position-independent distribution.
Authors: We agree that the main derivation relies on this premise and that stating it explicitly as a theorem hypothesis will improve clarity. In the revised manuscript we will rephrase the statement of the central result in §3 to list the preservation of position-dependent moments by the residual branches as a formal hypothesis. We will also add a concise counter-analysis: when residuals are removed the position-dependent first and second moments are not propagated across layers, and the recency bias disappears (reverting to the earlier-token bias of attention alone); a short derivation of this case will be included. For position-independent embeddings we will supply a brief numerical verification on synthetic inputs confirming that the recency effect vanishes. These additions will be kept short and placed directly in §3. revision: yes
Circularity Check
No circularity: recency bias derived from component interaction under stated assumptions
full rationale
The paper frames its central result as a theoretical analysis showing that stacked causal self-attention combined with LayerNorm produces recency bias, while separately examining the modulating roles of residual connections and the statistical distribution of input embeddings. No load-bearing step reduces by construction to a self-definition, a fitted parameter re-labeled as a prediction, or a self-citation chain; the derivation is presented as following from the architectural premises and distributional assumptions without re-expressing those premises as the output. The result is explicitly conditional on residuals being present and embeddings having position-dependent moments that survive LayerNorm, but this conditionality is stated up-front rather than smuggled in via redefinition. The analysis therefore remains self-contained against external benchmarks of the claimed interaction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal self-attention and LayerNorm are composed in the standard residual-block manner used in Transformer decoders.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that stacked causal self-attention layers combined with LayerNorm induce recency bias... the attention score at the second layer increases strictly with j, as long as j < i.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veli c kovi \'c . Round and round we go! what makes rotary positional encodings useful? In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[3]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1183--1193, 2023
work page 2023
-
[5]
Position information in transformers: An overview
Philipp Dufter, Martin Schmitt, and Hinrich Sch \"u tze. Position information in transformers: An overview. Computational Linguistics, 48 0 (3): 0 733--763, 2022
work page 2022
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Transformer language models without positional encodings still learn positional information
Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1382--1390, 2022
work page 2022
-
[8]
The impact of positional encoding on length generalization in transformers
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36: 0 24892--24928, 2023
work page 2023
-
[9]
Scaling laws of rope-based extrapolation
Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[10]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019
work page 2019
-
[11]
The fineweb datasets: Decanting the web for the finest text data at scale
Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024
work page 2024
-
[12]
Yarn: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[13]
Train short, test long: Attention with linear biases enables input length extrapolation
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022
work page 2022
-
[14]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[15]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024
work page 2024
-
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[17]
Yu-An Wang and Yun-Nung Chen. What do position embeddings learn? an empirical study of pre-trained language model positional encoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6840--6849, 2020
work page 2020
-
[18]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[19]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International conference on machine learning, pp.\ 10524--10533. PMLR, 2020
work page 2020
-
[20]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Rope to nope and back again: A new hybrid attention strategy
Bowen Yang, Bharat Venkitesh, Dwarak Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, and Acyr Locatelli. Rope to nope and back again: A new hybrid attention strategy. arXiv preprint arXiv:2501.18795, 2025 b
-
[22]
Length extrapolation of transformers: A survey from the perspective of positional encoding
Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 9959--9977, 2024
work page 2024
-
[23]
Chunsheng Zuo, Pavel Guerzhoy, and Michael Guerzhoy. Position information emerges in causal transformers without positional encodings via similarity of nearby embeddings. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9418--9430, 2025
work page 2025
-
[24]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[25]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[26]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[27]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.