pith. sign in

arxiv: 2606.06160 · v1 · pith:O3SCTMHInew · submitted 2026-06-04 · 💻 cs.AI · cs.CL

Where does Absolute Position come from in decoder-only Transformers?

Pith reviewed 2026-06-28 01:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords RoPEabsolute positioncausal maskresidual streamattention sinksdecoder-only transformerspositional encodingtransformer architecture
0
0 comments X

The pith

RoPE-trained decoder-only transformers distinguish absolute position in attention patterns through the causal mask and residual stream.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that relative encodings like RoPE still allow absolute position to appear in attention even though the inner product uses only offsets. The causal mask makes each query's softmax normalization depend on its absolute location by design. The residual stream adds a second path: the token at position zero evolves in isolation and its state gets read by later layers through sink heads. These two sources appear in all studied RoPE variants but shift in strength with NTK scaling or sliding windows. Replacing the BOS token cuts roughly forty percent of the residual contribution at early positions.

Core claim

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components: the causal mask is responsible for the first as its per-query softmax denominator depends on the absolute query position by construction; the residual stream supplies the second as the activation at position 0 attends only to itself and runs as a closed dynamical system from the embedding of the token at that position, which downstream attention reads through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance.

What carries the argument

Causal mask and residual stream, where position-zero activations form a closed dynamical system whose trajectory is read by sink-reading heads.

If this is right

  • NTK scaling suppresses the residual-stream component of absolute position leakage.
  • Sliding-window attention allows the residual component to accumulate with depth.
  • Standard RoPE balances the two leakage sources between the other variants.
  • Replacing the BOS embedding removes forty percent of the residual-stream component at early queries.
  • Attention sinks act as token-anchored stabilizers that forward a deterministic fingerprint of the position-zero token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Controlling the initial token embedding offers a direct way to adjust how much absolute position information reaches later layers.
  • The same mechanism may underlie attention-sink behavior observed across many decoder-only models.
  • Models that do not auto-prepend a fixed BOS token would carry a fingerprint that changes with the actual first input token.

Load-bearing premise

The per-query softmax denominator under the causal mask depends on absolute query position by construction, and the activation at position 0 runs as a closed dynamical system whose trajectory is read downstream by sink-reading heads.

What would settle it

Measure attention weights after replacing the BOS embedding or removing the causal mask and check whether absolute-position distinctions in the patterns disappear.

Figures

Figures reproduced from arXiv: 2606.06160 by Fabrizio Silvestri, Umberto Nanni, Valeria Ruscio.

Figure 1
Figure 1. Figure 1: Mean ∆R2 by query position bin under causal (baseline) and bidirectional attention, for the three architectures. Query position bin midpoints are plotted on a log scale. Model Baseline ∆R 2 Bidirectional ∆R 2 Residual-stream share Causal-mask share Llama-3.2-1B 0.0121 0.0022 18% 82% Llama-3.2-3B 0.0143 0.0051 35% 64% Llama-3.1-8B 0.0135 0.0048 36% 64% [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
read the original abstract

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RoPE-trained decoder-only transformers exhibit absolute position information in their attention patterns despite RoPE providing only relative offsets in inner products. This leakage is traced to two sources: the causal mask, whose per-query softmax normalization depends on absolute query position by construction, and the residual stream, where the position-0 activation forms a closed dynamical system under causal attention and is read downstream by sink-reading heads. The relative strength of these channels varies across standard RoPE, NTK scaling, and sliding-window attention; replacing the BOS embedding is reported to remove 40% of the residual-stream component.

Significance. If the tracing and measurements are robust, the work supplies a mechanistic account of how absolute positional cues arise from standard decoder-only components even under relative encodings. The explicit comparison across architectural variants (NTK, sliding windows) and the BOS ablation provide concrete, falsifiable observations that could inform positional encoding design and attention analysis in long-context models.

major comments (2)
  1. [Abstract] Abstract: the 40% removal figure upon BOS replacement is presented as evidence for the residual-stream channel, yet no measurement protocol, layers/heads, distance metric, or statistical controls are supplied; without these the quantitative attribution to the residual stream cannot be evaluated.
  2. [Abstract] Abstract: while the causal-mask softmax denominator indeed grows with absolute query index, the manuscript does not show a controlled isolation (e.g., via an equation or ablation) demonstrating how much of the observed absolute-position distinction in attention patterns is produced by this denominator versus the residual-stream channel.
minor comments (2)
  1. The terms 'sink-reading heads' and 'attention sinks' are used without a concise definition or pointer to their first appearance; a one-sentence gloss would aid readers.
  2. Consider adding a small table or figure panel that tabulates the relative contribution of the two channels for each of the three architectures studied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and the recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 40% removal figure upon BOS replacement is presented as evidence for the residual-stream channel, yet no measurement protocol, layers/heads, distance metric, or statistical controls are supplied; without these the quantitative attribution to the residual stream cannot be evaluated.

    Authors: We agree that the abstract does not supply the measurement protocol, layers/heads, distance metric, or statistical controls. We will revise the abstract to include these details so that the 40% figure can be properly evaluated. revision: yes

  2. Referee: [Abstract] Abstract: while the causal-mask softmax denominator indeed grows with absolute query index, the manuscript does not show a controlled isolation (e.g., via an equation or ablation) demonstrating how much of the observed absolute-position distinction in attention patterns is produced by this denominator versus the residual-stream channel.

    Authors: We agree that the current manuscript does not include a controlled isolation of the two channels. We will add an equation formalizing the position dependence of the softmax denominator together with an ablation that holds the denominator fixed, to quantify the separate contribution of each source. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper offers an observational architectural analysis tracing absolute-position leakage to the causal mask's position-dependent softmax normalization and the position-0 residual stream's closed dynamics under self-attention. These follow directly from standard decoder-only definitions once RoPE supplies only relative inner products; no equations, fitted parameters, or self-citations are invoked to force the claimed mechanisms. The account remains self-contained against external architectural benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The account rests on standard transformer components (causal mask, residual stream, softmax) plus two newly named constructs (sink-reading heads, attention sinks) whose independent evidence is not supplied in the abstract.

axioms (1)
  • domain assumption The causal mask's per-query softmax denominator depends on absolute query position by construction.
    Invoked directly in the abstract as the first leakage source.
invented entities (2)
  • sink-reading heads no independent evidence
    purpose: Read the trajectory of the position-0 residual stream downstream.
    Introduced in the abstract to explain how the closed dynamical system at position 0 propagates absolute information.
  • attention sinks no independent evidence
    purpose: Token-anchored stabilizers that pass a deterministic fingerprint of the token at position 0.
    Named in the abstract as the mechanism carrying the residual-stream signal.

pith-pipeline@v0.9.1-grok · 5716 in / 1403 out tokens · 32219 ms · 2026-06-28T01:21:08.426783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    The impact of positional encoding on length generalization in transformers , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Beyond Position: the emergence of wavelet-like properties in Transformers , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  11. [11]

    International Conference on Learning Representations , volume=

    Efficient streaming language models with attention sinks , author=. International Conference on Learning Representations , volume=

  12. [12]

    International Conference on Learning Representations , volume=

    When attention sink emerges in language models: An empirical view , author=. International Conference on Learning Representations , volume=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    What are you sinking? a geometric approach on attention sink , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    arXiv preprint arXiv:2108.12409 , year=

    Train short, test long: Attention with linear biases enables input length extrapolation , author=. arXiv preprint arXiv:2108.12409 , year=

  15. [15]

    International Conference on Learning Representations , volume=

    Yarn: Efficient context window extension of large language models , author=. International Conference on Learning Representations , volume=

  16. [16]

    Understanding Transformer Memorization Recall Through Idioms

    Haviv, Adi and Cohen, Ido and Gidron, Jacob and Schuster, Roei and Goldberg, Yoav and Geva, Mor. Understanding Transformer Memorization Recall Through Idioms. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.19

  17. [17]

    arXiv preprint arXiv:2209.11895 , year=

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  18. [18]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

  19. [19]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  20. [20]

    5-coder technical report , author=

    Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

  21. [21]

    Jiang, Albert Q. and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de Las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and Lavaud, L. Mistral 7B , url =. CoRR , keywords =. doi:10.48550/ARXIV.2310.06825 , eprint =