pith. sign in

arxiv: 2602.16837 · v2 · pith:CWCWQBLGnew · submitted 2026-02-18 · 💻 cs.LG

A Structural Theory of Position Bias in Transformers

classification 💻 cs.LG
keywords biasmodelspositiontheoryattentionbehaviorcausalclosely
0
0 comments X
read the original abstract

Transformer models systematically favor certain token positions, yet the architectural origins of this position bias remain poorly understood. This bias is closely connected to the Lost-in-the-Middle phenomenon, where models underutilize information placed in the middle of the context. We show that Lost-in-the-Middle-type behavior can arise from the architecture of causal Transformers itself. To do so, we develop a structural theory of position bias based on residual-aware cumulative attention rollout. At finite depth, causal masking and residual connections induce broad, often U-shaped, influence profiles. At infinite depth, our framework resolves a discrepancy between prior attention-only collapse theory and practical Transformer behavior: residual connections fundamentally change cumulative attention dynamics. Empirically, the predicted profiles closely match measured input-token influence in pretrained language models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Kinetic theory for Transformers and the lost-in-the-middle phenomenon

    math.AP 2026-05 conditional novelty 8.0

    A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.