pith. machine review for the scientific record. sign in

arxiv: 2509.21042 · v4 · submitted 2025-09-25 · 💻 cs.CL · cs.LG

Recognition: unknown

LayerNorm Induces Recency Bias in Transformer Decoders

Authors on Pith no claims yet
classification 💻 cs.CL cs.LG
keywords biascausalpositionalself-attentiondecodersrecencytransformerarchitectural
0
0 comments X
read the original abstract

Causal self-attention provides positional information to Transformer decoders. Prior work has shown that stacks of causal self-attention layers alone induce a positional bias in attention scores toward earlier tokens. However, this differs from the bias toward later tokens typically observed in Transformer decoders, known as recency bias. We address this discrepancy by analyzing the interaction between causal self-attention and other architectural components. We show that stacked causal self-attention layers combined with LayerNorm induce recency bias. Furthermore, we examine the effects of residual connections and the distribution of input token embeddings on this bias. Our results provide new theoretical insights into how positional information interacts with architectural components and suggest directions for improving positional encoding strategies.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.