RoPE Distinguishes Neither Positions Nor Tokens in Long Contexts, Provably
Pith reviewed 2026-05-19 15:29 UTC · model grok-4.3
The pith
RoPE-based attention loses locality bias and token relevance consistency as context length grows, with failure probability approaching 0.5.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that as context length increases, RoPE-based attention becomes unpredictable and loses its locality bias and consistency in token relevance, with the probability of failure approaching 0.5. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time.
What carries the argument
The rotation of query and key vectors by position-dependent angles in the RoPE formulation, whose dot-product scores are analyzed for invariance properties over increasing context lengths.
If this is right
- Increasing the RoPE base hyperparameter helps distinguish different tokens but sacrifices the ability to distinguish positions.
- Multi-head and multi-layer architectures do not overcome the limitations of RoPE in long contexts.
- Fundamentally new mechanisms for encoding position and token order may be needed for future long-context transformer models.
Where Pith is reading between the lines
- The invariance properties could explain why ordering-sensitive tasks degrade in very long contexts even when models appear to handle length.
- Encodings that avoid angle-based rotations might preserve both locality and distinction without the observed tradeoff.
Load-bearing premise
The theoretical analysis abstracts away from the specific content of the context and depends only on its length.
What would settle it
Empirical computation of attention score distributions across many random token sequences at increasing context lengths, checking whether the probability of locality bias failure or relevance inconsistency approaches 0.5.
Figures
read the original abstract
We identify intrinsic limitations of Rotary Positional Embeddings (RoPE) in Transformer-based long-context language models. Our theoretical analysis abstracts away from the specific content of the context and depends only on its length. We prove that as context length increases, RoPE-based attention becomes unpredictable and loses two properties that are central to its effectiveness. First, it loses its locality bias: RoPE is no more likely to favor nearer positions than substantially farther ones. Second, it loses consistency in token relevance: a key vector that receives a higher attention score than an alternative at one position may receive a lower score at another. In both cases, the probability of failure approaches 0.5, no better than random guessing. We further prove that the attention score can remain unchanged when a key token is moved to a different position, or even replaced by a different token, indicating a failure to distinguish positions or tokens. Adjusting the RoPE base trades off distinguishing positions against distinguishing tokens but cannot preserve both at the same time. Increasing the RoPE base hyperparameter, a common practice in today's long-context models, helps distinguish different tokens, but inevitably sacrifices the ability to distinguish positions. Our empirical analysis shows that multi-head, multi-layer architectures are insufficient to overcome these limitations. Our findings suggest that fundamentally new mechanisms for encoding position and token order may be needed in future Transformer long-context language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RoPE-based attention in Transformers loses its locality bias and token-relevance consistency as context length grows, with failure probability approaching 0.5. It proves that attention scores can be invariant to position shifts or token swaps for fixed query/key vectors, that RoPE cannot simultaneously distinguish both positions and tokens, and that increasing the base trades off one for the other. Multi-head and multi-layer architectures are shown empirically to be insufficient to overcome these issues, suggesting fundamentally new positional mechanisms may be required for long-context models.
Significance. If the mathematical results hold under the fixed-vector abstraction, the work would provide a rigorous, length-dependent explanation for RoPE's observed weaknesses in long contexts and motivate alternatives. The content-independent proofs are a strength for generality, but the direct leap to trained models requires additional justification that no compensating embeddings exist.
major comments (2)
- [§3] §3 (Theoretical Analysis, proofs of locality loss and invariance): The derivations fix query and key vectors while varying only position indices and length. This makes the probability-of-failure claims (approaching 0.5) mathematically clean but load-bearing for the model-level conclusion; the paper does not show that no learned embeddings can align with the rotations to restore locality or distinguishability.
- [§4] §4 (Empirical Analysis on multi-head/layer models): The experiments evaluate post-training attention behavior and confirm the theoretical predictions, yet they do not include a training-time search over embeddings to test whether compensating directions can be learned. This leaves the practical implication for long-context Transformers unclosed.
minor comments (2)
- [§3] Notation for the rotation angle θ_i in the main theorem statement could be clarified to explicitly separate the base hyperparameter from the position index.
- The abstract states 'probability of failure approaches 0.5' without a short parenthetical on the measure (uniform over positions or tokens); adding this would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope of our theoretical abstraction and its implications. We address each major comment below, explaining our reasoning and indicating where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Theoretical Analysis, proofs of locality loss and invariance): The derivations fix query and key vectors while varying only position indices and length. This makes the probability-of-failure claims (approaching 0.5) mathematically clean but load-bearing for the model-level conclusion; the paper does not show that no learned embeddings can align with the rotations to restore locality or distinguishability.
Authors: We thank the referee for highlighting this aspect of our abstraction. The decision to hold query and key vectors fixed isolates the contribution of the RoPE rotations themselves; the proofs establish that, for any such vectors, the attention scores lose locality bias and token-consistency as length grows. Because the result is independent of the particular vector values, it applies equally to vectors produced by any learned embeddings. No choice of base vectors can evade the length-dependent randomization induced by the accumulating rotations. We will add a short clarifying paragraph in §3 that explicitly connects the fixed-vector results to the generality over learned embeddings. revision: partial
-
Referee: [§4] §4 (Empirical Analysis on multi-head/layer models): The experiments evaluate post-training attention behavior and confirm the theoretical predictions, yet they do not include a training-time search over embeddings to test whether compensating directions can be learned. This leaves the practical implication for long-context Transformers unclosed.
Authors: We agree that an end-to-end training experiment could be informative. Nevertheless, the content-independent proofs already imply that any embeddings learned at training time remain subject to the same failure modes once context length exceeds the training regime. Exhaustive search over embedding directions during training would therefore be both computationally expensive and, according to the theory, unable to restore the lost properties for arbitrary lengths. We will insert a concise discussion paragraph after the empirical results that makes this connection explicit and notes the consequent implications for long-context training. revision: partial
Circularity Check
No circularity; direct mathematical derivation from RoPE definition
full rationale
The paper's core results are obtained by fixing query and key vectors (explicitly abstracting content) and deriving attention-score statistics solely from the length-dependent rotation angles in the standard RoPE formulation. The proofs that locality bias and token-relevance consistency each fail with probability approaching 0.5, and that scores can be invariant under position shifts or token swaps, follow from elementary trigonometric identities and uniform-distribution arguments over position indices; none of these steps presuppose the claimed failure probabilities or invoke self-citations for uniqueness. The trade-off when varying the RoPE base is likewise obtained by direct comparison of the resulting angle spacings. The derivation is therefore self-contained against the definition of RoPE and does not reduce to any fitted input, renamed empirical pattern, or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RoPE attention properties can be analyzed independently of token content, depending solely on position indices and context length.
Reference graph
Works this paper leans on
-
[1]
RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=
work page 2021
-
[2]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , eprint=
work page 2024
-
[3]
Proceedings of the 41st International Conference on Machine Learning , articleno =
Fu, Yao and Panda, Rameswar and Niu, Xinyao and Yue, Xiang and Hajishirzi, Hannaneh and Kim, Yoon and Peng, Hao , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =
work page 2024
-
[4]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , institution =
work page 2026
-
[5]
International Conference on Learning Representations , year=
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. International Conference on Learning Representations , year=
-
[6]
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings , author=. 2025 , eprint=
work page 2025
-
[7]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation , year =
-
[9]
Rope to Nope and Back Again: A New Hybrid Attention Strategy , author=. 2025 , eprint=
work page 2025
-
[10]
Microsoft Research Blog , volume=
Phi-2: The surprising power of small language models , author=. Microsoft Research Blog , volume=
-
[11]
On the Liapunov limit error in the theory of probability , author=. Ark. Mat. Astr. Fys. , volume=
-
[12]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author =. 2026 , month = apr, url =
work page 2026
-
[13]
Length Generalization of Causal Transformers without Position Encoding , author=. 2024 , eprint=
work page 2024
- [14]
- [15]
- [16]
-
[17]
100M Token Context Windows , author =. 2024 , month = aug, url =
work page 2024
-
[18]
Data engineering for scaling language models to 128k context
Data engineering for scaling language models to 128k context , author=. arXiv preprint arXiv:2402.10171 , year=
-
[19]
arXiv preprint arXiv:2310.05209 , year=
Scaling laws of rope-based extrapolation , author=. arXiv preprint arXiv:2310.05209 , year=
-
[20]
Focused Transformer: Contrastive Training for Context Scaling , author=. 2023 , eprint=
work page 2023
-
[21]
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective , author=. 2024 , eprint=
work page 2024
-
[22]
Base of RoPE Bounds Context Length , url =
Xu, Mingyu and Men, Xin and Wang, Bingning and Zhang, Qingyu and Lin, Hongyu and Lu, Yaojie and Han, Xianpei and Chen, Weipeng , booktitle =. Base of RoPE Bounds Context Length , url =. doi:10.52202/079017-2773 , editor =
-
[23]
Effective Long-Context Scaling of Foundation Models
Xiong, Wenhan and Liu, Jingyu and Molybog, Igor and Zhang, Hejia and Bhargava, Prajjwal and Hou, Rui and Martin, Louis and Rungta, Rashi and Sankararaman, Karthik Abinav and Oguz, Barlas and Khabsa, Madian and Fang, Han and Mehdad, Yashar and Narang, Sharan and Malik, Kshitiz and Fan, Angela and Bhosale, Shruti and Edunov, Sergey and Lewis, Mike and Wang,...
-
[24]
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool-Use , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =
work page 2024
-
[25]
arXiv preprint arXiv:2403.04797 , year =
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding , author =. arXiv preprint arXiv:2403.04797 , year =
-
[26]
and Ermon, Stefano and Rudra, Atri and R
Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , volume=. 2022 , url=
work page 2022
-
[27]
Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle=. 2024 , url=
work page 2024
-
[28]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. arXiv preprint arXiv:2310.01889 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Sequence Parallelism: Long Sequence Training from System Perspective , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2023 , address=. doi:10.18653/v1/2023.acl-long.134 , url=
-
[30]
How to Train Long-Context Language Models (Effectively) , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=. 2025 , address=. doi:10.18653/v1/2025.acl-long.366 , url=
-
[31]
Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638
-
[32]
Efficient Streaming Language Models with Attention Sinks , author=. arXiv , year=
-
[33]
GitHub repository , howpublished =
Gregory Kamradt , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[34]
RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. 2024 , journal=
work page 2024
-
[35]
BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack , author=. 2024 , eprint=
work page 2024
-
[36]
Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers , author=. 2026 , eprint=
work page 2026
-
[37]
Round and Round We Go! What Makes Rotary Positional Encodings Useful? , author=. arXiv preprint arXiv:2410.06205 , year=
-
[38]
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval , author=. 2025 , eprint=
work page 2025
-
[39]
Advances in Neural Information Processing Systems , volume =
Attention Is All You Need , author =. Advances in Neural Information Processing Systems , volume =
-
[40]
Attention Residuals , author =. 2026 , archiveprefix =. 2603.15031 , primaryclass =
work page internal anchor Pith review arXiv 2026
-
[41]
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation , author=. 2025 , eprint=
work page 2025
-
[42]
Retrieval Head Mechanistically Explains Long-Context Factuality , author=. 2024 , eprint=
work page 2024
- [43]
-
[44]
Rotary Offset Features in Large Language Models , author=. 2025 , eprint=
work page 2025
-
[45]
On the token distance modeling ability of higher RoPE attention dimension , author=. 2024 , eprint=
work page 2024
-
[46]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author=. 2024 , eprint=
work page 2024
-
[47]
Turing, A. M. , biburl =. Computing Machinery and Intelligence , url =. Mind , jstor_articletype =
- [48]
- [49]
- [50]
-
[51]
Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations , author=. 2019 , eprint=
work page 2019
-
[52]
Transformers: State-of-the-Art Natural Language Processing
Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...
work page 2020
- [53]
- [54]
- [55]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.