pith. sign in

arxiv: 2509.21042 · v4 · submitted 2025-09-25 · 💻 cs.CL · cs.LG

LayerNorm Induces Recency Bias in Transformer Decoders

Pith reviewed 2026-05-18 14:07 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LayerNormrecency biascausal self-attentionTransformer decoderpositional biasresidual connectionsattention scores
0
0 comments X p. Extension

The pith

LayerNorm combined with stacked causal self-attention produces recency bias in Transformer decoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that causal self-attention alone tends to favor earlier tokens, yet real Transformer decoders exhibit the opposite pattern of favoring later tokens. This reversal arises specifically from the interaction of stacked causal attention layers with LayerNorm. A reader would care because this mechanism explains an observed positional preference that affects how models handle sequence order. The analysis also checks how residual connections and embedding statistics shape the bias. These findings clarify why certain architectural choices lead to predictable attention behaviors.

Core claim

The authors establish that stacked causal self-attention layers combined with LayerNorm induce recency bias, meaning attention scores become higher for later tokens in the sequence. This differs from the earlier-token bias seen when causal self-attention operates without LayerNorm. The work further examines how residual connections modulate the bias and how the statistical distribution of input token embeddings influences its strength.

What carries the argument

The interaction of stacked causal self-attention with LayerNorm, which reverses the positional preference from early tokens to recent ones.

If this is right

  • Residual connections alter the magnitude of the induced recency bias.
  • Input embedding statistics determine how strongly the recency effect appears.
  • The findings supply theoretical reasons for observed positional patterns in decoder-only models.
  • The results point toward targeted adjustments in positional encoding methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LayerNorm interaction could be tested in encoder-only models to see if recency bias emerges there as well.
  • Controlling embedding variance at initialization might offer a practical way to tune the strength of the bias without changing LayerNorm.
  • Tasks that require balanced attention across long sequences may benefit from normalization variants that weaken this recency induction.

Load-bearing premise

The analysis assumes residual connections are present and that input token embeddings have a statistical distribution that interacts with LayerNorm as described.

What would settle it

Measuring attention score distributions in an otherwise identical decoder stack with LayerNorm removed and checking whether the preference for later tokens disappears.

Figures

Figures reproduced from arXiv: 2509.21042 by Edward Choi, Junu Kim, Lei Ji, Xiao Liu, Yeyun Gong, Zhenghao Lin.

Figure 1
Figure 1. Figure 1: Causal mask induces positional information even in the absence of causal input dependen [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Simulation of a Transformer without parameter and explicit positional encoding results. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inner-product Gram matrix heatmap of a trained Transformer decoder without positional [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulation of a Transformer without parameters using RoPE. We visualize the attention [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Diagonal-normalized attention heatmap of LLMs (first 4 layers). Attention scores were [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Extended result of Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulation of a Transformer without parameter and explicit positional encoding results, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention pattern of a trained Transformer without explicit positional encoding. The plots [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The extended result of Figure 4 with [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The extended result of Figure 4 without causal mask (i.e. Transformer Encoder) [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Llama-3-8B Attention Pattern 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Phi-4 Per-Layer Attention Pattern 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qwen3-8B Per-Layer Attention Pattern 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Llama-3-8B Attention Pattern (Normalized) [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Phi-4 Attention Pattern (Normalized) 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qwen-8B Attention Pattern (Normalized) 22 [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
read the original abstract

Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that causal self-attention alone produces earlier-token bias, but when combined with LayerNorm in stacked decoder layers it induces recency bias (preference for later tokens). It further analyzes how residual connections and the per-position statistics of input embeddings modulate or enable this effect, offering a theoretical account of the interaction.

Significance. If the derivations are correct, the work supplies a concrete mechanistic explanation for the recency bias routinely observed in trained Transformer decoders. Explicitly characterizing the necessary conditions on residuals and embedding moments is a strength; it turns an empirical observation into a falsifiable architectural claim and suggests targeted interventions for positional encoding.

major comments (1)
  1. [§3] §3 (main derivation): the recency-bias result is shown only under the assumption that residual branches preserve position-dependent first and second moments of the embeddings through the LayerNorm; the paper should state this premise as a theorem hypothesis and supply a short counter-derivation or numerical check when the residual is removed or when embeddings are drawn from a position-independent distribution.
minor comments (2)
  1. [Abstract] Abstract: the sentence 'we show that stacked causal self-attention layers combined with LayerNorm induce recency bias' would be clearer if it briefly indicated the role of the residual path.
  2. [§2] Notation: define the exact placement of LayerNorm relative to the residual addition (pre-norm vs. post-norm) before the first equation that uses it.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the helpful suggestion for improving the rigor of our theoretical presentation. We address the comment below.

read point-by-point responses
  1. Referee: [§3] §3 (main derivation): the recency-bias result is shown only under the assumption that residual branches preserve position-dependent first and second moments of the embeddings through the LayerNorm; the paper should state this premise as a theorem hypothesis and supply a short counter-derivation or numerical check when the residual is removed or when embeddings are drawn from a position-independent distribution.

    Authors: We agree that the main derivation relies on this premise and that stating it explicitly as a theorem hypothesis will improve clarity. In the revised manuscript we will rephrase the statement of the central result in §3 to list the preservation of position-dependent moments by the residual branches as a formal hypothesis. We will also add a concise counter-analysis: when residuals are removed the position-dependent first and second moments are not propagated across layers, and the recency bias disappears (reverting to the earlier-token bias of attention alone); a short derivation of this case will be included. For position-independent embeddings we will supply a brief numerical verification on synthetic inputs confirming that the recency effect vanishes. These additions will be kept short and placed directly in §3. revision: yes

Circularity Check

0 steps flagged

No circularity: recency bias derived from component interaction under stated assumptions

full rationale

The paper frames its central result as a theoretical analysis showing that stacked causal self-attention combined with LayerNorm produces recency bias, while separately examining the modulating roles of residual connections and the statistical distribution of input embeddings. No load-bearing step reduces by construction to a self-definition, a fitted parameter re-labeled as a prediction, or a self-citation chain; the derivation is presented as following from the architectural premises and distributional assumptions without re-expressing those premises as the output. The result is explicitly conditional on residuals being present and embeddings having position-dependent moments that survive LayerNorm, but this conditionality is stated up-front rather than smuggled in via redefinition. The analysis therefore remains self-contained against external benchmarks of the claimed interaction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard definitions of causal attention, LayerNorm, and residual connections together with an implicit assumption about the statistical distribution of token embeddings; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Causal self-attention and LayerNorm are composed in the standard residual-block manner used in Transformer decoders.
    The interaction result presupposes the conventional placement of these components.

pith-pipeline@v0.9.0 · 5652 in / 1176 out tokens · 41326 ms · 2026-05-18T14:07:39.649142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Round and round we go! what makes rotary positional encodings useful? In The Thirteenth International Conference on Learning Representations, 2025

    Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veli c kovi \'c . Round and round we go! what makes rotary positional encodings useful? In The Thirteenth International Conference on Learning Representations, 2025

  3. [3]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023

  4. [4]

    Latent positional information is in the self-attention variance of transformer language models without positional embeddings

    Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alexander Rudnicky, and Peter Ramadge. Latent positional information is in the self-attention variance of transformer language models without positional embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 1183--1193, 2023

  5. [5]

    Position information in transformers: An overview

    Philipp Dufter, Martin Schmitt, and Hinrich Sch \"u tze. Position information in transformers: An overview. Computational Linguistics, 48 0 (3): 0 733--763, 2022

  6. [6]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    Transformer language models without positional encodings still learn positional information

    Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.\ 1382--1390, 2022

  8. [8]

    The impact of positional encoding on length generalization in transformers

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36: 0 24892--24928, 2023

  9. [9]

    Scaling laws of rope-based extrapolation

    Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024

  10. [10]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

  11. [11]

    The fineweb datasets: Decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

  12. [12]

    Yarn: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024

  13. [13]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

  14. [14]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  15. [15]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

  16. [16]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  17. [17]

    What do position embeddings learn? an empirical study of pre-trained language model positional encoding

    Yu-An Wang and Yun-Nung Chen. What do position embeddings learn? an empirical study of pre-trained language model positional encoding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 6840--6849, 2020

  18. [18]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024

  19. [19]

    On layer normalization in the transformer architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International conference on machine learning, pp.\ 10524--10533. PMLR, 2020

  20. [20]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

  21. [21]

    Rope to nope and back again: A new hybrid attention strategy

    Bowen Yang, Bharat Venkitesh, Dwarak Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, and Acyr Locatelli. Rope to nope and back again: A new hybrid attention strategy. arXiv preprint arXiv:2501.18795, 2025 b

  22. [22]

    Length extrapolation of transformers: A survey from the perspective of positional encoding

    Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 9959--9977, 2024

  23. [23]

    Position information emerges in causal transformers without positional encodings via similarity of nearby embeddings

    Chunsheng Zuo, Pavel Guerzhoy, and Michael Guerzhoy. Position information emerges in causal transformers without positional encodings via similarity of nearby embeddings. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9418--9430, 2025

  24. [24]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  25. [25]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  26. [26]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  27. [27]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...