Where does Absolute Position come from in decoder-only Transformers?

Fabrizio Silvestri; Umberto Nanni; Valeria Ruscio

arxiv: 2606.06160 · v1 · pith:O3SCTMHInew · submitted 2026-06-04 · 💻 cs.AI · cs.CL

Where does Absolute Position come from in decoder-only Transformers?

Valeria Ruscio , Umberto Nanni , Fabrizio Silvestri This is my paper

Pith reviewed 2026-06-28 01:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords RoPEabsolute positioncausal maskresidual streamattention sinksdecoder-only transformerspositional encodingtransformer architecture

0 comments

The pith

RoPE-trained decoder-only transformers distinguish absolute position in attention patterns through the causal mask and residual stream.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that relative encodings like RoPE still allow absolute position to appear in attention even though the inner product uses only offsets. The causal mask makes each query's softmax normalization depend on its absolute location by design. The residual stream adds a second path: the token at position zero evolves in isolation and its state gets read by later layers through sink heads. These two sources appear in all studied RoPE variants but shift in strength with NTK scaling or sliding windows. Replacing the BOS token cuts roughly forty percent of the residual contribution at early positions.

Core claim

What carries the argument

Causal mask and residual stream, where position-zero activations form a closed dynamical system whose trajectory is read by sink-reading heads.

If this is right

NTK scaling suppresses the residual-stream component of absolute position leakage.
Sliding-window attention allows the residual component to accumulate with depth.
Standard RoPE balances the two leakage sources between the other variants.
Replacing the BOS embedding removes forty percent of the residual-stream component at early queries.
Attention sinks act as token-anchored stabilizers that forward a deterministic fingerprint of the position-zero token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Controlling the initial token embedding offers a direct way to adjust how much absolute position information reaches later layers.
The same mechanism may underlie attention-sink behavior observed across many decoder-only models.
Models that do not auto-prepend a fixed BOS token would carry a fingerprint that changes with the actual first input token.

Load-bearing premise

The per-query softmax denominator under the causal mask depends on absolute query position by construction, and the activation at position 0 runs as a closed dynamical system whose trajectory is read downstream by sink-reading heads.

What would settle it

Measure attention weights after replacing the BOS embedding or removing the causal mask and check whether absolute-position distinctions in the patterns disappear.

Figures

Figures reproduced from arXiv: 2606.06160 by Fabrizio Silvestri, Umberto Nanni, Valeria Ruscio.

**Figure 1.** Figure 1: Mean ∆R2 by query position bin under causal (baseline) and bidirectional attention, for the three architectures. Query position bin midpoints are plotted on a log scale. Model Baseline ∆R 2 Bidirectional ∆R 2 Residual-stream share Causal-mask share Llama-3.2-1B 0.0121 0.0022 18% 82% Llama-3.2-3B 0.0143 0.0051 35% 64% Llama-3.1-8B 0.0135 0.0048 36% 64% [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

read the original abstract

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper traces absolute-position leakage in RoPE decoder-only models to the causal mask's growing softmax support and the closed dynamics at position zero.

read the letter

The core observation is that RoPE supplies only relative inner products, yet attention patterns still encode absolute position. The authors pin this on two standard pieces of the architecture: the causal mask makes the softmax denominator depend on query index, and the position-zero activation runs as its own dynamical system under self-attention before later layers read it.

They show the balance between these two channels shifts with common variants—NTK scaling damps the residual-stream part, sliding windows let it build with depth, and plain RoPE sits in between. The BOS-replacement intervention that removes 40 % of the residual component is a direct test of the second channel.

The account is architectural rather than fitted, which keeps it clean. The mechanisms follow once you accept that RoPE is strictly relative and attention is causal. No extra assumptions appear to be required.

The main limitation is that the abstract supplies no model sizes, datasets, or measurement details, so the 40 % figure and the head classifications cannot yet be checked for robustness or sensitivity to training choices. The term “sink-reading heads” also needs a clear operational definition in the full text.

This is useful reading for anyone tuning position encodings or debugging long-context behavior. The central claim is straightforward enough that a serious referee could evaluate the experiments quickly. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that RoPE-trained decoder-only transformers exhibit absolute position information in their attention patterns despite RoPE providing only relative offsets in inner products. This leakage is traced to two sources: the causal mask, whose per-query softmax normalization depends on absolute query position by construction, and the residual stream, where the position-0 activation forms a closed dynamical system under causal attention and is read downstream by sink-reading heads. The relative strength of these channels varies across standard RoPE, NTK scaling, and sliding-window attention; replacing the BOS embedding is reported to remove 40% of the residual-stream component.

Significance. If the tracing and measurements are robust, the work supplies a mechanistic account of how absolute positional cues arise from standard decoder-only components even under relative encodings. The explicit comparison across architectural variants (NTK, sliding windows) and the BOS ablation provide concrete, falsifiable observations that could inform positional encoding design and attention analysis in long-context models.

major comments (2)

[Abstract] Abstract: the 40% removal figure upon BOS replacement is presented as evidence for the residual-stream channel, yet no measurement protocol, layers/heads, distance metric, or statistical controls are supplied; without these the quantitative attribution to the residual stream cannot be evaluated.
[Abstract] Abstract: while the causal-mask softmax denominator indeed grows with absolute query index, the manuscript does not show a controlled isolation (e.g., via an equation or ablation) demonstrating how much of the observed absolute-position distinction in attention patterns is produced by this denominator versus the residual-stream channel.

minor comments (2)

The terms 'sink-reading heads' and 'attention sinks' are used without a concise definition or pointer to their first appearance; a one-sentence gloss would aid readers.
Consider adding a small table or figure panel that tabulates the relative contribution of the two channels for each of the three architectures studied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the 40% removal figure upon BOS replacement is presented as evidence for the residual-stream channel, yet no measurement protocol, layers/heads, distance metric, or statistical controls are supplied; without these the quantitative attribution to the residual stream cannot be evaluated.

Authors: We agree that the abstract does not supply the measurement protocol, layers/heads, distance metric, or statistical controls. We will revise the abstract to include these details so that the 40% figure can be properly evaluated. revision: yes
Referee: [Abstract] Abstract: while the causal-mask softmax denominator indeed grows with absolute query index, the manuscript does not show a controlled isolation (e.g., via an equation or ablation) demonstrating how much of the observed absolute-position distinction in attention patterns is produced by this denominator versus the residual-stream channel.

Authors: We agree that the current manuscript does not include a controlled isolation of the two channels. We will add an equation formalizing the position dependence of the softmax denominator together with an ablation that holds the denominator fixed, to quantify the separate contribution of each source. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper offers an observational architectural analysis tracing absolute-position leakage to the causal mask's position-dependent softmax normalization and the position-0 residual stream's closed dynamics under self-attention. These follow directly from standard decoder-only definitions once RoPE supplies only relative inner products; no equations, fitted parameters, or self-citations are invoked to force the claimed mechanisms. The account remains self-contained against external architectural benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The account rests on standard transformer components (causal mask, residual stream, softmax) plus two newly named constructs (sink-reading heads, attention sinks) whose independent evidence is not supplied in the abstract.

axioms (1)

domain assumption The causal mask's per-query softmax denominator depends on absolute query position by construction.
Invoked directly in the abstract as the first leakage source.

invented entities (2)

sink-reading heads no independent evidence
purpose: Read the trajectory of the position-0 residual stream downstream.
Introduced in the abstract to explain how the closed dynamical system at position 0 propagates absolute information.
attention sinks no independent evidence
purpose: Token-anchored stabilizers that pass a deterministic fingerprint of the token at position 0.
Named in the abstract as the mechanism carrying the residual-stream signal.

pith-pipeline@v0.9.1-grok · 5716 in / 1403 out tokens · 32219 ms · 2026-06-28T01:21:08.426783+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024
[9]

Advances in Neural Information Processing Systems , volume=

The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=
[10]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Beyond Position: the emergence of wavelet-like properties in Transformers , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[11]

International Conference on Learning Representations , volume=

Efficient streaming language models with attention sinks , author=. International Conference on Learning Representations , volume=
[12]

International Conference on Learning Representations , volume=

When attention sink emerges in language models: An empirical view , author=. International Conference on Learning Representations , volume=
[13]

Advances in Neural Information Processing Systems , volume=

What are you sinking? a geometric approach on attention sink , author=. Advances in Neural Information Processing Systems , volume=
[14]

arXiv preprint arXiv:2108.12409 , year=

Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

Pith/arXiv arXiv
[15]

International Conference on Learning Representations , volume=

Yarn: Efficient context window extension of large language models , author=. International Conference on Learning Representations , volume=
[16]

Understanding Transformer Memorization Recall Through Idioms

Haviv, Adi and Cohen, Ido and Gidron, Jacob and Schuster, Roei and Goldberg, Yoav and Geva, Mor. Understanding Transformer Memorization Recall Through Idioms. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.19

work page doi:10.18653/v1/2023.eacl-main.19 2023
[17]

arXiv preprint arXiv:2209.11895 , year=

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

Pith/arXiv arXiv
[18]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023
[19]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[20]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv
[21]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de Las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L. Mistral 7B , url =. CoRR , keywords =. doi:10.48550/ARXIV.2310.06825 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

Neurocomputing , volume=

Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

2024

[9] [9]

Advances in Neural Information Processing Systems , volume=

The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Beyond Position: the emergence of wavelet-like properties in Transformers , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[11] [11]

International Conference on Learning Representations , volume=

Efficient streaming language models with attention sinks , author=. International Conference on Learning Representations , volume=

[12] [12]

International Conference on Learning Representations , volume=

When attention sink emerges in language models: An empirical view , author=. International Conference on Learning Representations , volume=

[13] [13]

Advances in Neural Information Processing Systems , volume=

What are you sinking? a geometric approach on attention sink , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

arXiv preprint arXiv:2108.12409 , year=

Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

Pith/arXiv arXiv

[15] [15]

International Conference on Learning Representations , volume=

Yarn: Efficient context window extension of large language models , author=. International Conference on Learning Representations , volume=

[16] [16]

Understanding Transformer Memorization Recall Through Idioms

Haviv, Adi and Cohen, Ido and Gidron, Jacob and Schuster, Roei and Goldberg, Yoav and Geva, Mor. Understanding Transformer Memorization Recall Through Idioms. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.19

work page doi:10.18653/v1/2023.eacl-main.19 2023

[17] [17]

arXiv preprint arXiv:2209.11895 , year=

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

Pith/arXiv arXiv

[18] [18]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023

[19] [19]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[20] [20]

5-coder technical report , author=

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

Pith/arXiv arXiv

[21] [21]

Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de Las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L. Mistral 7B , url =. CoRR , keywords =. doi:10.48550/ARXIV.2310.06825 , eprint =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825