Positional Encoding via Token-Aware Phase Attention

Hongyuan Zhan; R\'emi Munos; Sheng Shen; Yuandong Tian; Yu Wang

arxiv: 2509.12635 · v3 · submitted 2025-09-16 · 💻 cs.CL · cs.AI

Positional Encoding via Token-Aware Phase Attention

Yu Wang , Sheng Shen , R\'emi Munos , Hongyuan Zhan , Yuandong Tian This is my paper

Pith reviewed 2026-05-18 16:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords positional encodingrotary positional embeddinglong-context modelingattention mechanismphase functioncontext extrapolationperplexityretrieval performance

0 comments

The pith

Token-Aware Phase Attention uses a learnable phase function to eliminate RoPE's distance-dependent bias in attention scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that Rotary Positional Embedding introduces an intrinsic distance-dependent bias in attention scores under practical assumptions, which restricts effective modeling of long contexts. It proposes Token-Aware Phase Attention as an alternative that embeds a learnable phase function directly into the attention mechanism. This design maintains token interactions across extended ranges, supports extension to longer contexts through straightforward continual pretraining, and extrapolates to unseen lengths. Readers would care because it promises simpler scaling of context length without post-training fixes and delivers measurable gains in perplexity and retrieval accuracy.

Core claim

We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light continual pretraining, extrapolates to unseen lengths, and attains substanti

What carries the argument

Token-Aware Phase Attention, which adds a learnable phase function to the attention mechanism to adjust scores without fixed distance bias.

If this is right

TAPA supports direct and light continual pretraining to extend models to longer contexts without rescaling or hyperparameter retuning.
TAPA extrapolates to sequence lengths not encountered during training.
TAPA produces substantially lower perplexity than RoPE-style baselines in the long-context regime.
TAPA delivers stronger retrieval performance than RoPE-style baselines in long-context settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Learnable phase functions may reduce reliance on manual scaling techniques across other sequence modeling components.
The same mechanism could be tested in non-transformer architectures that use attention over long inputs.
End-to-end optimization of positional phases might enable more uniform attention distributions in very long documents.
Combining TAPA with existing length-extrapolation tricks could yield hybrid methods for even larger context windows.

Load-bearing premise

RoPE introduces an intrinsic distance-dependent bias in attention scores under the practical assumptions of typical training regimes.

What would settle it

Measure attention scores in a trained RoPE model on sequences longer than the training length and check whether they display systematic distance-dependent decay patterns that TAPA avoids.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAPA adds a learnable phase function to attention to sidestep RoPE's claimed distance bias, with reported long-context gains after light training, though the proof's assumptions look like the main thing to check.

read the letter

The main point is that this paper introduces Token-Aware Phase Attention, or TAPA, which puts a learnable phase function straight into the attention mechanism. They argue this avoids the intrinsic distance-dependent bias they say RoPE creates under practical assumptions about token stats and fixed phases. The method is meant to keep long-range token interactions intact, support direct continual pretraining for longer contexts, extrapolate to unseen lengths, and deliver lower perplexity plus stronger retrieval than RoPE baselines.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to prove, under practical assumptions, that Rotary Positional Embeddings (RoPE) introduce an intrinsic distance-dependent bias in attention scores that limits long-context modeling. It proposes Token-Aware Phase Attention (TAPA), which incorporates a learnable phase function into the attention mechanism. TAPA is asserted to preserve long-range token interactions, support extension to longer contexts via light continual pretraining, extrapolate to unseen lengths, and deliver substantially lower perplexity with stronger retrieval performance than RoPE-style baselines in the long-context regime.

Significance. If the central claims hold, TAPA could provide a principled positional encoding that mitigates a core limitation of RoPE without post-hoc adjustments. The reported empirical gains in perplexity and retrieval suggest practical value for long-context transformers. The approach of a learnable phase function offers a distinct direction from fixed or rescaled encodings.

major comments (2)

[§3] §3 (Proof of RoPE bias): The practical assumptions under which RoPE is shown to introduce a distance-dependent bias in attention scores are not validated against typical pretraining regimes (e.g., joint optimization of embeddings and phases or realistic length distributions). This is load-bearing for the central claim, as the motivation for replacing RoPE with a learnable phase function rests directly on the existence of this bias in practice.
[§5, Table 2] §5 and Table 2: The long-context perplexity and retrieval results report gains over RoPE baselines but omit error bars, multiple random seeds, or statistical tests. Without these, it is difficult to confirm that the improvements are robust rather than sensitive to hyperparameter choices or data splits.

minor comments (2)

[§4] The mathematical definition of the learnable phase function would be clearer if presented as an explicit equation in the main method section rather than relying on prose description.
[Figure 3] Figure 3 caption could more explicitly contrast the phase behavior of TAPA versus RoPE at extended positions to aid interpretation of the extrapolation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (Proof of RoPE bias): The practical assumptions under which RoPE is shown to introduce a distance-dependent bias in attention scores are not validated against typical pretraining regimes (e.g., joint optimization of embeddings and phases or realistic length distributions). This is load-bearing for the central claim, as the motivation for replacing RoPE with a learnable phase function rests directly on the existence of this bias in practice.

Authors: The assumptions in §3 are chosen to isolate the effect of the rotary phase under conditions that commonly arise in pretraining, such as fixed embedding matrices and sequence lengths drawn from the training distribution. We acknowledge that direct validation under joint optimization of embeddings and phases, as well as more varied length distributions, would strengthen the practical relevance of the bias result. In the revised manuscript we will add a short empirical subsection to §3 that reports attention-score bias measurements under joint training and under length distributions matching standard pretraining corpora. This addition directly addresses the load-bearing concern while preserving the original analytic argument. revision: yes
Referee: [§5, Table 2] §5 and Table 2: The long-context perplexity and retrieval results report gains over RoPE baselines but omit error bars, multiple random seeds, or statistical tests. Without these, it is difficult to confirm that the improvements are robust rather than sensitive to hyperparameter choices or data splits.

Authors: We agree that the current presentation of results in §5 and Table 2 would benefit from explicit measures of variability and statistical support. In the revised version we will rerun the long-context perplexity and retrieval experiments using at least three independent random seeds. Table 2 will be updated to report mean values together with standard deviations, and we will add a brief statistical analysis (paired t-tests or Wilcoxon tests) comparing TAPA against each RoPE-style baseline. These changes will be documented in §5 to confirm that the observed gains are robust. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper states a proof under practical assumptions that RoPE creates distance-dependent bias in attention scores, then introduces TAPA as a distinct learnable phase function in the attention mechanism. No equations, fitted parameters, or self-citations are shown reducing the claimed long-context improvements, extrapolation, or lower perplexity to quantities defined by construction from the same inputs or prior author results. The central method adds independent learnable components rather than renaming or re-deriving existing patterns, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a distance-dependent bias in RoPE under practical assumptions and on the ability of a learnable phase function to mitigate it without post-hoc adjustments. No invented physical entities are introduced.

free parameters (1)

parameters of the learnable phase function
The phase function is described as learnable, implying its parameters are fitted during training to achieve the reported gains.

axioms (1)

domain assumption RoPE introduces an intrinsic distance-dependent bias in attention scores under practical assumptions
Explicitly stated as proven in the first sentence of the abstract.

pith-pipeline@v0.9.0 · 5647 in / 1399 out tokens · 34894 ms · 2026-05-18T16:14:58.002238+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Attnϕ,M,α(q,k)=q⊤Mk·cos(2π|m−n|αϕ(q,k)) (Def. 3.1); quadratic phase ϕ(q,k)=q⊤Nk yields stationary-phase cancellation (Thm 3.2).
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RoPE distance bias proved via Q-linear independence of θd and Weyl equidistribution (Thm 2.1).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Positional LSH: Binary Block Matrix Approximation for Attention with Linear Biases
cs.LG 2026-05 unverdicted novelty 7.0

ALiBi bias is the expectation of positional LSH-induced block masks, yielding spectral and max-norm approximation bounds that reduce long-context biased attention to randomized short-context unbiased attention.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Singh Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C’esar Teodoro Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Extending Context Window of Large Language Models via Positional Interpolation

RoPE positional encoding with NTK-aware long-context support. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via posi- tional interpolation.ArXiv, abs/2306.15595, 2023.https://api.semanticscholar.org/CorpusID:259262376. Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alex Rudnicky, and Peter J. R...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

https://api.semanticscholar.org/ CorpusID:57759363. Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness.ArXiv, abs/2205.14135, 2022.https://api.semanticscholar.org/CorpusID:249151871. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-t...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

https: //api.semanticscholar.org/CorpusID:52967399. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.ArXiv, abs/2402.13753,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2307.02486 (2023)

Confirms use of Rotary Positional Encoding. Yao Fu, Hangbo Bao, Zewen Chi, Yijuan Lu, Binyang Li, Chenliang Li, Linjun Shou, Ming Gong, and Nan Duan. Longnet: Scaling transformers to 1,000,000,000 tokens.ArXiv, abs/2307.02486, 2023.https://arxiv.org/abs/2307.02486. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason ...

work page arXiv 2023
[6]

arXiv preprint arXiv:2203.16634 , year=

https://api.semanticscholar.org/CorpusID:268357635. Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information.ArXiv, abs/2203.16634,

work page arXiv
[7]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

https://api.semanticscholar.org/CorpusID: 247839823. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654, 2020.https://api.semanticscholar.org/CorpusID:219531210. Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better relative pos...

work page internal anchor Pith review Pith/arXiv arXiv 2006
[8]

arXiv preprint arXiv:2006.15595 (2020)

https://api.semanticscholar.org/ CorpusID:258987259. Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training.ArXiv, abs/2006.15595,

work page arXiv 2006
[9]

Li, J., Li, D., Savarese, S., and Hoi, S

ISBN 9780471510451.https://books.google.com/books?id=lCTvAAAAMAAJ. James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms.ArXiv, abs/2105.03824, 2021.https://arxiv.org/abs/2105.03824. Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alexey Rogozhnikov. Cape: Encoding relat...

work page arXiv 2021
[10]

arXiv preprint arXiv:2310.05209 , year=

https: //api.semanticscholar.org/CorpusID:235358538. Xiaoran Liu, Hang Yan, Shuo Zhang, Chen An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation.ArXiv, abs/2310.05209, 2023.https://api.semanticscholar.org/CorpusID:263828829. 11 Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,

work page arXiv 2023
[11]

semanticscholar.org/CorpusID:3312944

https://api. semanticscholar.org/CorpusID:3312944. Xin Ma, Yang Liu, Jingjing Liu, and Xiaoxu Ma. Mesa-extrapolation: A weave position encoding method for enhanced extrapolation in llms.ArXiv, abs/2410.15859, 2024.https://api.semanticscholar.org/CorpusID:273502613. Bo Peng, Yuxuan Du, Xiaohui Zhang, Zichen Ma, Wei Liu, and Wei Hu. Rwkv: Reinventing rnns f...

work page arXiv 2024
[12]

Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

https://arxiv.org/abs/ 2302.10866. Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.https://api.semanticscholar.org/CorpusID:237347130. Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training

work page arXiv 2021
[13]

Compressive Transformers for Long-Range Sequence Modelling

https://api. semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-r...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

ISBN 978-0691032160. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021.https://api.semanticscholar.org/CorpusID:233307138. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapol...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Transformer architecture with RoPE and key/query rotary projection. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Feng Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Haochen Ding, Hao-Xing Hu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Attention Is All You Need

https://api.semanticscholar.org/ CorpusID:13756489. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.https://arxiv.org/abs/1706.03762. Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. Length generalizati...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

A., Oguz, B., et al

https://api.semanticscholar.org/CorpusID:269213989. Thomas Wolff.Lectures on Harmonic Analysis, volume 29 ofUniversity Lecture Series. American Mathematical Society, Providence, RI, 2003.https://www.math.ubc.ca/~ilaba/wolff/. Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankara...

work page arXiv 2003
[18]

Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ

sin 2π(m−n)θ d =:Γλ +Z λ (17) where λ=m−n . Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ. For convenience, we denote it by: ∆λ,Λ =: Γλ −Γ Λ. (18) RemarkIn addition to Assumption 2.1, if we further assume that{A i}i and{B j}j are sub-gaussian satisfying P(|Ai −µ 0|> η)< C 1e−C2η2 ,P(|B j −ν 0|> η)< C 1e−C2η2 (19) for some C1,...

work page 1974
[19]

= 1 2 ·(1−ϵ 0)·cos 2πλθ ϵ0 0 . (72) Finally, combining (51), (71), and (72), we obtain |CD| ≥ 1 2 I − 1 2 |∆| ≥ 1 2|logθ 0| ˆ 1/4 λθ0 cos 2πy y dy− | 1 2 logθ0 ˆ λ 1/4 cos 2πy y dy| − 1 2 |∆| ≥(1−ϵ 0)·cos 2πλθ ϵ0 0 − 1 |logθ 0| −ϵ(D;λ, θ 0, α), (73) which is exactly (27), and hence proved Lemma C.2. G Proof of Theorem 3.2 Proof of Theorem 3.2.For convenie...

work page 2003

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Singh Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C’esar Teodoro Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Extending Context Window of Large Language Models via Positional Interpolation

RoPE positional encoding with NTK-aware long-context support. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via posi- tional interpolation.ArXiv, abs/2306.15595, 2023.https://api.semanticscholar.org/CorpusID:259262376. Ta-Chung Chi, Ting-Han Fan, Li-Wei Chen, Alex Rudnicky, and Peter J. R...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

https://api.semanticscholar.org/ CorpusID:57759363. Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness.ArXiv, abs/2205.14135, 2022.https://api.semanticscholar.org/CorpusID:249151871. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-t...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

https: //api.semanticscholar.org/CorpusID:52967399. Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens.ArXiv, abs/2402.13753,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2307.02486 (2023)

Confirms use of Rotary Positional Encoding. Yao Fu, Hangbo Bao, Zewen Chi, Yijuan Lu, Binyang Li, Chenliang Li, Linjun Shou, Ming Gong, and Nan Duan. Longnet: Scaling transformers to 1,000,000,000 tokens.ArXiv, abs/2307.02486, 2023.https://arxiv.org/abs/2307.02486. Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason ...

work page arXiv 2023

[6] [6]

arXiv preprint arXiv:2203.16634 , year=

https://api.semanticscholar.org/CorpusID:268357635. Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information.ArXiv, abs/2203.16634,

work page arXiv

[7] [7]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

https://api.semanticscholar.org/CorpusID: 247839823. Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. ArXiv, abs/2006.03654, 2020.https://api.semanticscholar.org/CorpusID:219531210. Zhiheng Huang, Davis Liang, Peng Xu, and Bing Xiang. Improve transformer models with better relative pos...

work page internal anchor Pith review Pith/arXiv arXiv 2006

[8] [8]

arXiv preprint arXiv:2006.15595 (2020)

https://api.semanticscholar.org/ CorpusID:258987259. Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training.ArXiv, abs/2006.15595,

work page arXiv 2006

[9] [9]

Li, J., Li, D., Savarese, S., and Hoi, S

ISBN 9780471510451.https://books.google.com/books?id=lCTvAAAAMAAJ. James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Ontanon. Fnet: Mixing tokens with fourier transforms.ArXiv, abs/2105.03824, 2021.https://arxiv.org/abs/2105.03824. Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, and Alexey Rogozhnikov. Cape: Encoding relat...

work page arXiv 2021

[10] [10]

arXiv preprint arXiv:2310.05209 , year=

https: //api.semanticscholar.org/CorpusID:235358538. Xiaoran Liu, Hang Yan, Shuo Zhang, Chen An, Xipeng Qiu, and Dahua Lin. Scaling laws of rope-based extrapolation.ArXiv, abs/2310.05209, 2023.https://api.semanticscholar.org/CorpusID:263828829. 11 Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.ArXiv, abs/1711.05101,

work page arXiv 2023

[11] [11]

semanticscholar.org/CorpusID:3312944

https://api. semanticscholar.org/CorpusID:3312944. Xin Ma, Yang Liu, Jingjing Liu, and Xiaoxu Ma. Mesa-extrapolation: A weave position encoding method for enhanced extrapolation in llms.ArXiv, abs/2410.15859, 2024.https://api.semanticscholar.org/CorpusID:273502613. Bo Peng, Yuxuan Du, Xiaohui Zhang, Zichen Ma, Wei Liu, and Wei Hu. Rwkv: Reinventing rnns f...

work page arXiv 2024

[12] [12]

Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

https://arxiv.org/abs/ 2302.10866. Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.https://api.semanticscholar.org/CorpusID:237347130. Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training

work page arXiv 2021

[13] [13]

Compressive Transformers for Long-Range Sequence Modelling

https://api. semanticscholar.org/CorpusID:49313245. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.https://api.semanticscholar.org/CorpusID:160025533. Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, and Timothy P. Lillicrap. Compressive transformers for long-r...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[14] [14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

ISBN 978-0691032160. Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021.https://api.semanticscholar.org/CorpusID:233307138. Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapol...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[15] [15]

Transformer architecture with RoPE and key/query rotary projection. Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Feng Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Haochen Ding, Hao-Xing Hu, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Attention Is All You Need

https://api.semanticscholar.org/ CorpusID:13756489. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.https://arxiv.org/abs/1706.03762. Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, and Xiaoling Wang. Length generalizati...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

A., Oguz, B., et al

https://api.semanticscholar.org/CorpusID:269213989. Thomas Wolff.Lectures on Harmonic Analysis, volume 29 ofUniversity Lecture Series. American Mathematical Society, Providence, RI, 2003.https://www.math.ubc.ca/~ilaba/wolff/. Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankara...

work page arXiv 2003

[18] [18]

Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ

sin 2π(m−n)θ d =:Γλ +Z λ (17) where λ=m−n . Note that Eq,kZλ = 0 by definition (5), and LHS of (7) is precisely Γλ −Γ Λ. For convenience, we denote it by: ∆λ,Λ =: Γλ −Γ Λ. (18) RemarkIn addition to Assumption 2.1, if we further assume that{A i}i and{B j}j are sub-gaussian satisfying P(|Ai −µ 0|> η)< C 1e−C2η2 ,P(|B j −ν 0|> η)< C 1e−C2η2 (19) for some C1,...

work page 1974

[19] [19]

= 1 2 ·(1−ϵ 0)·cos 2πλθ ϵ0 0 . (72) Finally, combining (51), (71), and (72), we obtain |CD| ≥ 1 2 I − 1 2 |∆| ≥ 1 2|logθ 0| ˆ 1/4 λθ0 cos 2πy y dy− | 1 2 logθ0 ˆ λ 1/4 cos 2πy y dy| − 1 2 |∆| ≥(1−ϵ 0)·cos 2πλθ ϵ0 0 − 1 |logθ 0| −ϵ(D;λ, θ 0, α), (73) which is exactly (27), and hence proved Lemma C.2. G Proof of Theorem 3.2 Proof of Theorem 3.2.For convenie...

work page 2003