Positional Encoding via Token-Aware Phase Attention
Pith reviewed 2026-05-18 16:14 UTC · model grok-4.3
The pith
Token-Aware Phase Attention uses a learnable phase function to eliminate RoPE's distance-dependent bias in attention scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substanti
What carries the argument
Token-Aware Phase Attention, which adds a learnable phase function to the attention mechanism to adjust scores without fixed distance bias.
If this is right
- TAPA supports direct and light continual pretraining to extend models to longer contexts without rescaling or hyperparameter retuning.
- TAPA extrapolates to sequence lengths not encountered during training.
- TAPA produces substantially lower perplexity than RoPE-style baselines in the long-context regime.
- TAPA delivers stronger retrieval performance than RoPE-style baselines in long-context settings.
Where Pith is reading between the lines
- Learnable phase functions may reduce reliance on manual scaling techniques across other sequence modeling components.
- The same mechanism could be tested in non-transformer architectures that use attention over long inputs.
- End-to-end optimization of positional phases might enable more uniform attention distributions in very long documents.
- Combining TAPA with existing length-extrapolation tricks could yield hybrid methods for even larger context windows.
Load-bearing premise
RoPE introduces an intrinsic distance-dependent bias in attention scores under the practical assumptions of typical training regimes.
What would settle it
Measure attention scores in a trained RoPE model on sequences longer than the training length and check whether they display systematic distance-dependent decay patterns that TAPA avoids.
read the original abstract
We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substantially lower perplexity and stronger retrieval performance in the long-context regime than RoPE-style baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to prove, under practical assumptions, that Rotary Positional Embeddings (RoPE) introduce an intrinsic distance-dependent bias in attention scores that limits long-context modeling. It proposes Token-Aware Phase Attention (TAPA), which incorporates a learnable phase function into the attention mechanism. TAPA is asserted to preserve long-range token interactions, support extension to longer contexts via light continual pretraining, extrapolate to unseen lengths, and deliver substantially lower perplexity with stronger retrieval performance than RoPE-style baselines in the long-context regime.
Significance. If the central claims hold, TAPA could provide a principled positional encoding that mitigates a core limitation of RoPE without post-hoc adjustments. The reported empirical gains in perplexity and retrieval suggest practical value for long-context transformers. The approach of a learnable phase function offers a distinct direction from fixed or rescaled encodings.
major comments (2)
- [§3] §3 (Proof of RoPE bias): The practical assumptions under which RoPE is shown to introduce a distance-dependent bias in attention scores are not validated against typical pretraining regimes (e.g., joint optimization of embeddings and phases or realistic length distributions). This is load-bearing for the central claim, as the motivation for replacing RoPE with a learnable phase function rests directly on the existence of this bias in practice.
- [§5, Table 2] §5 and Table 2: The long-context perplexity and retrieval results report gains over RoPE baselines but omit error bars, multiple random seeds, or statistical tests. Without these, it is difficult to confirm that the improvements are robust rather than sensitive to hyperparameter choices or data splits.
minor comments (2)
- [§4] The mathematical definition of the learnable phase function would be clearer if presented as an explicit equation in the main method section rather than relying on prose description.
- [Figure 3] Figure 3 caption could more explicitly contrast the phase behavior of TAPA versus RoPE at extended positions to aid interpretation of the extrapolation results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Proof of RoPE bias): The practical assumptions under which RoPE is shown to introduce a distance-dependent bias in attention scores are not validated against typical pretraining regimes (e.g., joint optimization of embeddings and phases or realistic length distributions). This is load-bearing for the central claim, as the motivation for replacing RoPE with a learnable phase function rests directly on the existence of this bias in practice.
Authors: The assumptions in §3 are chosen to isolate the effect of the rotary phase under conditions that commonly arise in pretraining, such as fixed embedding matrices and sequence lengths drawn from the training distribution. We acknowledge that direct validation under joint optimization of embeddings and phases, as well as more varied length distributions, would strengthen the practical relevance of the bias result. In the revised manuscript we will add a short empirical subsection to §3 that reports attention-score bias measurements under joint training and under length distributions matching standard pretraining corpora. This addition directly addresses the load-bearing concern while preserving the original analytic argument. revision: yes
-
Referee: [§5, Table 2] §5 and Table 2: The long-context perplexity and retrieval results report gains over RoPE baselines but omit error bars, multiple random seeds, or statistical tests. Without these, it is difficult to confirm that the improvements are robust rather than sensitive to hyperparameter choices or data splits.
Authors: We agree that the current presentation of results in §5 and Table 2 would benefit from explicit measures of variability and statistical support. In the revised version we will rerun the long-context perplexity and retrieval experiments using at least three independent random seeds. Table 2 will be updated to report mean values together with standard deviations, and we will add a brief statistical analysis (paired t-tests or Wilcoxon tests) comparing TAPA against each RoPE-style baseline. These changes will be documented in §5 to confirm that the observed gains are robust. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper states a proof under practical assumptions that RoPE creates distance-dependent bias in attention scores, then introduces TAPA as a distinct learnable phase function in the attention mechanism. No equations, fitted parameters, or self-citations are shown reducing the claimed long-context improvements, extrapolation, or lower perplexity to quantities defined by construction from the same inputs or prior author results. The central method adds independent learnable components rather than renaming or re-deriving existing patterns, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of the learnable phase function
axioms (1)
- domain assumption RoPE introduces an intrinsic distance-dependent bias in attention scores under practical assumptions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Attnϕ,M,α(q,k)=q⊤Mk·cos(2π|m−n|αϕ(q,k)) (Def. 3.1); quadratic phase ϕ(q,k)=q⊤Nk yields stationary-phase cancellation (Thm 3.2).
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RoPE distance bias proved via Q-linear independence of θd and Weyl equidistribution (Thm 2.1).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Singh Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C’esar Teodoro Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Extending Context Window of Large Language Models via Positional Interpolation
RoPE positional encoding with NTK-aware long-context support. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via posi- tional interpolation.ArXiv, abs/2306.15595, 2023.https://api.semanticscholar.org/CorpusID:259262376. Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alex Rudnicky, and Peter J. R...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://api.semanticscholar.org/ CorpusID:57759363. Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness.ArXiv, abs/2205.14135, 2022.https://api.semanticscholar.org/CorpusID:249151871. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-t...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
https: //api.semanticscholar.org/CorpusID:52967399. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.ArXiv, abs/2402.13753,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv preprint arXiv:2307.02486 (2023)
Confirms use of Rotary Positional Encoding. Yao Fu, Hangbo Bao, Zewen Chi, Yijuan Lu, Binyang Li, Chenliang Li, Linjun Shou, Ming Gong, and Nan Duan. Longnet: Scaling transformers to 1,000,000,000 tokens.ArXiv, abs/2307.02486, 2023.https://arxiv.org/abs/2307.02486. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason ...
-
[6]
arXiv preprint arXiv:2203.16634 , year=
https://api.semanticscholar.org/CorpusID:268357635. Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information.ArXiv, abs/2203.16634,
-
[7]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
https://api.semanticscholar.org/CorpusID: 247839823. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654, 2020.https://api.semanticscholar.org/CorpusID:219531210. Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better relative pos...
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[8]
arXiv preprint arXiv:2006.15595 (2020)
https://api.semanticscholar.org/ CorpusID:258987259. Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training.ArXiv, abs/2006.15595,
-
[9]
Li, J., Li, D., Savarese, S., and Hoi, S
ISBN 9780471510451.https://books.google.com/books?id=lCTvAAAAMAAJ. James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms.ArXiv, abs/2105.03824, 2021.https://arxiv.org/abs/2105.03824. Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alexey Rogozhnikov. Cape: Encoding relat...
-
[10]
arXiv preprint arXiv:2310.05209 , year=
https: //api.semanticscholar.org/CorpusID:235358538. Xiaoran Liu, Hang Yan, Shuo Zhang, Chen An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation.ArXiv, abs/2310.05209, 2023.https://api.semanticscholar.org/CorpusID:263828829. 11 Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,
-
[11]
semanticscholar.org/CorpusID:3312944
https://api. semanticscholar.org/CorpusID:3312944. Xin Ma, Yang Liu, Jingjing Liu, and Xiaoxu Ma. Mesa-extrapolation: A weave position encoding method for enhanced extrapolation in llms.ArXiv, abs/2410.15859, 2024.https://api.semanticscholar.org/CorpusID:273502613. Bo Peng, Yuxuan Du, Xiaohui Zhang, Zichen Ma, Wei Liu, and Wei Hu. Rwkv: Reinventing rnns f...
-
[12]
Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré
https://arxiv.org/abs/ 2302.10866. Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.https://api.semanticscholar.org/CorpusID:237347130. Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training
-
[13]
Compressive Transformers for Long-Range Sequence Modelling
https://api. semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-r...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[14]
RoFormer: Enhanced Transformer with Rotary Position Embedding
ISBN 978-0691032160. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021.https://api.semanticscholar.org/CorpusID:233307138. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapol...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Transformer architecture with RoPE and key/query rotary projection. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Feng Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Haochen Ding, Hao-Xing Hu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
https://api.semanticscholar.org/ CorpusID:13756489. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.https://arxiv.org/abs/1706.03762. Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. Length generalizati...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
https://api.semanticscholar.org/CorpusID:269213989. Thomas Wolff.Lectures on Harmonic Analysis, volume 29 ofUniversity Lecture Series. American Mathematical Society, Providence, RI, 2003.https://www.math.ubc.ca/~ilaba/wolff/. Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankara...
-
[18]
Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ
sin 2π(m−n)θ d =:Γλ +Z λ (17) where λ=m−n . Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ. For convenience, we denote it by: ∆λ,Λ =: Γλ −Γ Λ. (18) RemarkIn addition to Assumption 2.1, if we further assume that{A i}i and{B j}j are sub-gaussian satisfying P(|Ai −µ 0|> η)< C 1e−C2η2 ,P(|B j −ν 0|> η)< C 1e−C2η2 (19) for some C1,...
work page 1974
-
[19]
= 1 2 ·(1−ϵ 0)·cos 2πλθ ϵ0 0 . (72) Finally, combining (51), (71), and (72), we obtain |CD| ≥ 1 2 I − 1 2 |∆| ≥ 1 2|logθ 0| ˆ 1/4 λθ0 cos 2πy y dy− | 1 2 logθ0 ˆ λ 1/4 cos 2πy y dy| − 1 2 |∆| ≥(1−ϵ 0)·cos 2πλθ ϵ0 0 − 1 |logθ 0| −ϵ(D;λ, θ 0, α), (73) which is exactly (27), and hence proved Lemma C.2. G Proof of Theorem 3.2 Proof of Theorem 3.2.For convenie...
work page 2003
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.