pith. sign in

arxiv: 2605.15514 · v1 · pith:DMFM5PS5new · submitted 2026-05-15 · 💻 cs.CL · cs.AI· cs.LG

RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably

Pith reviewed 2026-05-19 15:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords rotary positional embeddingslong-context transformersattention mechanismspositional encodingRoPE limitations
0
0 comments X

The pith

RoPE-based attention loses locality bias and token relevance consistency as context length grows, with failure probability approaching 0.5.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that rotary positional embeddings cause attention scores in transformers to become unpredictable in long contexts. It demonstrates that the embeddings no longer favor nearby positions over distant ones and that token relevance rankings can reverse across different positions. The analysis shows attention scores can stay identical even after shifting a token's position or replacing it with a different token. This matters because many current long-context models rely on RoPE to handle extended sequences. Adjusting the base frequency improves token distinction but reduces position distinction, so both properties cannot be maintained together.

Core claim

We prove that as context length increases, RoPE-based attention becomes unpredictable and loses its locality bias and consistency in token relevance, with the probability of failure approaching 0.5. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time.

What carries the argument

The rotation of query and key vectors by position-dependent angles in the RoPE formulation, whose dot-product scores are analyzed for invariance properties over increasing context lengths.

If this is right

  • Increasing the RoPE base hyperparameter helps distinguish different tokens but sacrifices the ability to distinguish positions.
  • Multi-head and multi-layer architectures do not overcome the limitations of RoPE in long contexts.
  • Fundamentally new mechanisms for encoding position and token order may be needed for future long-context transformer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The invariance properties could explain why ordering-sensitive tasks degrade in very long contexts even when models appear to handle length.
  • Encodings that avoid angle-based rotations might preserve both locality and distinction without the observed tradeoff.

Load-bearing premise

The theoretical analysis abstracts away from the specific content of the context and depends only on its length.

What would settle it

Empirical computation of attention score distributions across many random token sequences at increasing context lengths, checking whether the probability of locality bias failure or relevance inconsistency approaches 0.5.

Figures

Figures reproduced from arXiv: 2605.15514 by Aram Galstyan, Eliu A Huerta, Hao Peng, Minyang Tian, Phillip Harris, Srikanth Ronanki, Subendhu Rongali, Yufeng Du.

Figure 1
Figure 1. Figure 1: Position aliasing induces an attention invariance failure: there exist large numbers of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the RoPE product and its normal approximation when [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustrations (a) for position inversion and aliasing, with corresponding probability estima [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Position inversion for key “cat” and query “pet”. Llama [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Heat maps of position aliasing and attention invariance pairs under BF16, showing the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of token inversion and aliasing. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token aliasing probabilities under BF16, for query [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Token inversion for keys “cat”, “number” [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) We ask models to extract the k-th element in an array consisting of only integers 0-3. (b, c) Dots and shadows represent mean and standard deviation of accuracy. Selected models: Grattafiori et al. (2024); Mistral AI Team (2024); Yang et al. (2025a); DeepSeek-AI et al. (2025); Team et al. (2026a); OpenAI et al. (2025) The indexing task In our position identification task, the model is presented with a … view at source ↗
Figure 10
Figure 10. Figure 10: Heat maps of position aliasing and attention invariance pairs. FP16, Llama3.1-8B, Layer [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution and probability of token aliasing for keys “cat” and “dog” and query “pet”. [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
read the original abstract

We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that RoPE-based attention in Transformers loses its locality bias and token-relevance consistency as context length grows, with failure probability approaching 0.5. It proves that attention scores can be invariant to position shifts or token swaps for fixed query/key vectors, that RoPE cannot simultaneously distinguish both positions and tokens, and that increasing the base trades off one for the other. Multi-head and multi-layer architectures are shown empirically to be insufficient to overcome these issues, suggesting fundamentally new positional mechanisms may be required for long-context models.

Significance. If the mathematical results hold under the fixed-vector abstraction, the work would provide a rigorous, length-dependent explanation for RoPE's observed weaknesses in long contexts and motivate alternatives. The content-independent proofs are a strength for generality, but the direct leap to trained models requires additional justification that no compensating embeddings exist.

major comments (2)
  1. [§3] §3 (Theoretical Analysis, proofs of locality loss and invariance): The derivations fix query and key vectors while varying only position indices and length. This makes the probability-of-failure claims (approaching 0.5) mathematically clean but load-bearing for the model-level conclusion; the paper does not show that no learned embeddings can align with the rotations to restore locality or distinguishability.
  2. [§4] §4 (Empirical Analysis on multi-head/layer models): The experiments evaluate post-training attention behavior and confirm the theoretical predictions, yet they do not include a training-time search over embeddings to test whether compensating directions can be learned. This leaves the practical implication for long-context Transformers unclosed.
minor comments (2)
  1. [§3] Notation for the rotation angle θ_i in the main theorem statement could be clarified to explicitly separate the base hyperparameter from the position index.
  2. The abstract states 'probability of failure approaches 0.5' without a short parenthetical on the measure (uniform over positions or tokens); adding this would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our theoretical abstraction and its implications. We address each major comment below, explaining our reasoning and indicating where revisions will strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical Analysis, proofs of locality loss and invariance): The derivations fix query and key vectors while varying only position indices and length. This makes the probability-of-failure claims (approaching 0.5) mathematically clean but load-bearing for the model-level conclusion; the paper does not show that no learned embeddings can align with the rotations to restore locality or distinguishability.

    Authors: We thank the referee for highlighting this aspect of our abstraction. The decision to hold query and key vectors fixed isolates the contribution of the RoPE rotations themselves; the proofs establish that, for any such vectors, the attention scores lose locality bias and token-consistency as length grows. Because the result is independent of the particular vector values, it applies equally to vectors produced by any learned embeddings. No choice of base vectors can evade the length-dependent randomization induced by the accumulating rotations. We will add a short clarifying paragraph in §3 that explicitly connects the fixed-vector results to the generality over learned embeddings. revision: partial

  2. Referee: [§4] §4 (Empirical Analysis on multi-head/layer models): The experiments evaluate post-training attention behavior and confirm the theoretical predictions, yet they do not include a training-time search over embeddings to test whether compensating directions can be learned. This leaves the practical implication for long-context Transformers unclosed.

    Authors: We agree that an end-to-end training experiment could be informative. Nevertheless, the content-independent proofs already imply that any embeddings learned at training time remain subject to the same failure modes once context length exceeds the training regime. Exhaustive search over embedding directions during training would therefore be both computationally expensive and, according to the theory, unable to restore the lost properties for arbitrary lengths. We will insert a concise discussion paragraph after the empirical results that makes this connection explicit and notes the consequent implications for long-context training. revision: partial

Circularity Check

0 steps flagged

No circularity; direct mathematical derivation from RoPE definition

full rationale

The paper's core results are obtained by fixing query and key vectors (explicitly abstracting content) and deriving attention-score statistics solely from the length-dependent rotation angles in the standard RoPE formulation. The proofs that locality bias and token-relevance consistency each fail with probability approaching 0.5, and that scores can be invariant under position shifts or token swaps, follow from elementary trigonometric identities and uniform-distribution arguments over position indices; none of these steps presuppose the claimed failure probabilities or invoke self-citations for uniqueness. The trade-off when varying the RoPE base is likewise obtained by direct comparison of the resulting angle spacings. The derivation is therefore self-contained against the definition of RoPE and does not reduce to any fitted input, renamed empirical pattern, or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a mathematical abstraction that treats attention scores as depending solely on relative positions and total length, with no free parameters or new entities introduced.

axioms (1)
  • domain assumption RoPE attention properties can be analyzed independently of token content, depending solely on position indices and context length.
    Explicitly stated in the abstract as the foundation for the theoretical analysis.

pith-pipeline@v0.9.0 · 5808 in / 1079 out tokens · 59645 ms · 2026-05-19T15:29:44.440825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 3 internal anchors

  1. [1]

    2021 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=

  2. [2]

    2024 , eprint=

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=

  3. [3]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Fu, Yao and Panda, Rameswar and Niu, Xinyao and Yue, Xiang and Hajishirzi, Hannaneh and Kim, Yoon and Peng, Hao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  4. [4]

    2026 , institution =

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , institution =

  5. [5]

    International Conference on Learning Representations , year=

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=

  6. [6]

    2025 , eprint=

    Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings , author=. 2025 , eprint=

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  8. [8]

    NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation , year =

  9. [9]

    2025 , eprint=

    Rope to Nope and Back Again: A New Hybrid Attention Strategy , author=. 2025 , eprint=

  10. [10]

    Microsoft Research Blog , volume=

    Phi-2: The surprising power of small language models , author=. Microsoft Research Blog , volume=

  11. [11]

    On the Liapunov limit error in the theory of probability , author=. Ark. Mat. Astr. Fys. , volume=

  12. [12]

    2026 , month = apr, url =

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month = apr, url =

  13. [13]

    2024 , eprint=

    Length Generalization of Causal Transformers without Position Encoding , author=. 2024 , eprint=

  14. [14]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  15. [15]

    2025 , note =

    Meta , title =. 2025 , note =

  16. [16]

    2025 , note =

    OpenAI , title =. 2025 , note =

  17. [17]

    2024 , month = aug, url =

    100M Token Context Windows , author =. 2024 , month = aug, url =

  18. [18]

    Data engineering for scaling language models to 128k context

    Data engineering for scaling language models to 128k context , author=. arXiv preprint arXiv:2402.10171 , year=

  19. [19]

    arXiv preprint arXiv:2310.05209 , year=

    Scaling laws of rope-based extrapolation , author=. arXiv preprint arXiv:2310.05209 , year=

  20. [20]

    2023 , eprint=

    Focused Transformer: Contrastive Training for Context Scaling , author=. 2023 , eprint=

  21. [21]

    2024 , eprint=

    Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective , author=. 2024 , eprint=

  22. [22]

    Base of RoPE Bounds Context Length , url =

    Xu, Mingyu and Men, Xin and Wang, Bingning and Zhang, Qingyu and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , booktitle =. Base of RoPE Bounds Context Length , url =. doi:10.52202/079017-2773 , editor =

  23. [23]

    Effective Long-Context Scaling of Foundation Models

    Xiong, Wenhan and Liu, Jingyu and Molybog, Igor and Zhang, Hejia and Bhargava, Prajjwal and Hou, Rui and Martin, Louis and Rungta, Rashi and Sankararaman, Karthik Abinav and Oguz, Barlas and Khabsa, Madian and Fang, Han and Mehdad, Yashar and Narang, Sharan and Malik, Kshitiz and Fan, Angela and Bhosale, Shruti and Edunov, Sergey and Lewis, Mike and Wang,...

  24. [24]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool-Use , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =

  25. [25]

    arXiv preprint arXiv:2403.04797 , year =

    Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , author =. arXiv preprint arXiv:2403.04797 , year =

  26. [26]

    and Ermon, Stefano and Rudra, Atri and R

    Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , volume=. 2022 , url=

  27. [27]

    2024 , url=

    Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle=. 2024 , url=

  28. [28]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. arXiv preprint arXiv:2310.01889 , year=

  29. [29]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Sequence Parallelism: Long Sequence Training from System Perspective , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2023 , address=. doi:10.18653/v1/2023.acl-long.134 , url=

  30. [30]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    How to Train Long-Context Language Models (Effectively) , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , address=. doi:10.18653/v1/2025.acl-long.366 , url=

  31. [31]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

  32. [32]

    arXiv , year=

    Efficient Streaming Language Models with Attention Sinks , author=. arXiv , year=

  33. [33]

    GitHub repository , howpublished =

    Gregory Kamradt , title =. GitHub repository , howpublished =. 2023 , publisher =

  34. [34]

    2024 , journal=

    RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. 2024 , journal=

  35. [35]

    2024 , eprint=

    BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. 2024 , eprint=

  36. [36]

    2026 , eprint=

    Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers , author=. 2026 , eprint=

  37. [37]

    Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

    Round and Round We Go! What Makes Rotary Positional Encodings Useful? , author=. arXiv preprint arXiv:2410.06205 , year=

  38. [38]

    2025 , eprint=

    Context Length Alone Hurts LLM Performance Despite Perfect Retrieval , author=. 2025 , eprint=

  39. [39]

    Advances in Neural Information Processing Systems , volume =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =

  40. [40]

    Attention Residuals

    Attention Residuals , author =. 2026 , archiveprefix =. 2603.15031 , primaryclass =

  41. [41]

    2025 , eprint=

    The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation , author=. 2025 , eprint=

  42. [42]

    2024 , eprint=

    Retrieval Head Mechanistically Explains Long-Context Factuality , author=. 2024 , eprint=

  43. [43]

    2026 , eprint=

    Retrieval Heads are Dynamic , author=. 2026 , eprint=

  44. [44]

    2025 , eprint=

    Rotary Offset Features in Large Language Models , author=. 2025 , eprint=

  45. [45]

    2024 , eprint=

    On the token distance modeling ability of higher RoPE attention dimension , author=. 2024 , eprint=

  46. [46]

    2024 , eprint=

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2024 , eprint=

  47. [47]

    Turing, A. M. , biburl =. Computing Machinery and Intelligence , url =. Mind , jstor_articletype =

  48. [48]

    2024 , howpublished =

    Mistral-7B-Instruct-v0.3 , author =. 2024 , howpublished =

  49. [49]

    2025 , eprint=

    DeepSeek-V3 Technical Report , author=. 2025 , eprint=

  50. [50]

    , howpublished =

    n.d. , howpublished =

  51. [51]

    2019 , eprint=

    Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations , author=. 2019 , eprint=

  52. [52]

    Transformers: State-of-the-Art Natural Language Processing

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

  53. [53]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  54. [54]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  55. [55]

    2026 , eprint=

    Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=