A Structural Theory of Position Bias in Transformers

Hanna Herasimchyk; Robin Labryga; S\"oren Laue; Tomislav Prusina

arxiv: 2602.16837 · v2 · pith:CWCWQBLGnew · submitted 2026-02-18 · 💻 cs.LG

A Structural Theory of Position Bias in Transformers

Hanna Herasimchyk , Robin Labryga , Tomislav Prusina , S\"oren Laue This is my paper

classification 💻 cs.LG

keywords biasmodelspositiontheoryattentionbehaviorcausalclosely

0 comments

read the original abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. This bias is closely connected to the Lost-in-the-Middle phenomenon, where models underutilize information placed in the middle of the context. We show that Lost-in-the-Middle-type behavior can arise from the architecture of causal Transformers itself. To do so, we develop a structural theory of position bias based on residual-aware cumulative attention rollout. At finite depth, causal masking and residual connections induce broad, often U-shaped, influence profiles. At infinite depth, our framework resolves a discrepancy between prior attention-only collapse theory and practical Transformer behavior: residual connections fundamentally change cumulative attention dynamics. Empirically, the predicted profiles closely match measured input-token influence in pretrained language models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon
math.AP 2026-05 conditional novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.