Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks
Pith reviewed 2026-05-22 10:35 UTC · model grok-4.3
The pith
Non-semisimple Jordan blocks generate distance-modulated oscillatory features in relative positional encoding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a non-semisimple one-parameter representation realized by a complex Jordan block produces, for causal lag d, the oscillatory-polynomial features e^{-γd} cos(ωd), e^{-γd} sin(ωd), d e^{-γd} cos(ωd) and d e^{-γd} sin(ωd), thereby realizing a distance-modulated phase basis d e^{iωd} rather than merely adjoining a separate distance channel to rotary encoding.
What carries the argument
The non-semisimple complex Jordan block that places a rotary eigenvalue and a nilpotent element together, together with the contragredient query action needed to compensate for the non-orthogonal positional map.
If this is right
- Attention logits can now incorporate query-key interactions in which phase is scaled by distance inside a single basis vector.
- Stabilized variants trade the exact group law for bounded shear and improved numerical behavior.
- The exact representation requires the contragredient query action to keep the non-orthogonal map from distorting the relative operator.
- Kernel diagnostics confirm that the oscillatory-polynomial features appear exactly when the Jordan block is used.
Where Pith is reading between the lines
- Higher-order nilpotent blocks could generate quadratic or higher polynomial multipliers on the same oscillatory carrier.
- The same non-semisimple construction might be applied to other group representations used for positional encoding.
- Hybrid models could combine the exact Jordan block on some heads with stabilized or ALiBi blocks on others.
- The structural evidence suggests testing whether the distance-modulated basis improves sample efficiency on tasks whose optimal attention patterns contain explicit lag scaling.
Load-bearing premise
The non-semisimple Jordan block representation remains useful and numerically stable once embedded inside transformer attention, and the contragredient query action compensates for non-orthogonality without introducing uncontrolled artifacts.
What would settle it
On the Jordan-friendly synthetic language-model task, if the coupled Jordan basis produces no improvement over RoPE or direct-sum baselines when the target explicitly contains distance-modulated phase interactions, the usefulness claim would be falsified.
Figures
read the original abstract
Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-\gamma d}\cos(\omega d)$, $e^{-\gamma d}\sin(\omega d)$, $d e^{-\gamma d}\cos(\omega d)$, and $d e^{-\gamma d}\sin(\omega d)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{i\omega d}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Jordan-RoPE, a relative positional encoding based on non-semisimple complex Jordan blocks containing both a rotary eigenvalue and a nilpotent component. It claims this yields an exact relative operator producing coupled oscillatory-polynomial features such as e^{-γd} cos(ωd), e^{-γd} sin(ωd), d e^{-γd} cos(ωd), and d e^{-γd} sin(ωd) for causal lag d, realizing a distance-modulated phase basis rather than a simple additive distance channel. The manuscript gives the real block form, specifies the required contragredient query action for the non-orthogonal map, distinguishes the exact representation from bounded-shear stabilized variants that improve numerics but break the group law, and reports positive results on a synthetic diagnostic language-model task plus a small WikiText-103 byte LM where a scaled-exact variant outperforms RoPE and direct-sum Jordan baselines (while RoPE+ALiBi remains strongest overall).
Significance. If the exact features can be realized stably inside attention, the construction supplies a principled, group-theoretic route to distance-modulated rotary encodings that naturally couple polynomial and oscillatory terms; this could matter for sequence tasks whose target interactions depend on both phase and lag. The structural derivation, explicit real-block realization, synthetic diagnostic that isolates the coupled basis, and explicit separation of exact versus stabilized forms are genuine strengths. The current evidence is deliberately scoped as structural rather than a broad performance claim, and the limited model size plus the fact that RoPE+ALiBi still wins overall keep the practical impact modest pending larger-scale tests.
major comments (2)
- [Abstract and formulation of Exact Jordan-RoPE] The central claim that the non-semisimple Jordan block produces the exact relative operator with features e^{-γd} cos(ωd) and d e^{-γd} cos(ωd) (Abstract) rests on embedding the non-orthogonal positional map and applying the specified contragredient query action. The manuscript itself notes that nilpotent components are numerically fragile and therefore introduces bounded-shear stabilizations that break the exact group law; this directly affects whether the distance-modulated phase basis can appear in practice without uncontrolled perturbations from rounding in the matrix exponential or shear term.
- [Empirical evaluation] On the small WikiText-103 byte LM the scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, yet RoPE+ALiBi remains strongest overall (Abstract). Because the empirical scope is deliberately limited and the performance gain is intra-family rather than cross-family, the practical advantage of the coupled Jordan basis for distance-modulated interactions is not yet load-bearing for a broad claim.
minor comments (2)
- [Abstract] The causal-lag definition d = i - j ≥ 0 is stated clearly in the Abstract; ensure the same indexing convention is used without ambiguity when the contragredient query action is defined in the main text.
- [Conclusion] The paper appropriately describes its evidence as “structural rather than a broad performance claim”; repeating this framing in the conclusion would help readers calibrate expectations.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.
read point-by-point responses
-
Referee: The central claim that the non-semisimple Jordan block produces the exact relative operator with features e^{-γd} cos(ωd) and d e^{-γd} cos(ωd) (Abstract) rests on embedding the non-orthogonal positional map and applying the specified contragredient query action. The manuscript itself notes that nilpotent components are numerically fragile and therefore introduces bounded-shear stabilizations that break the exact group law; this directly affects whether the distance-modulated phase basis can appear in practice without uncontrolled perturbations from rounding in the matrix exponential or shear term.
Authors: We thank the referee for highlighting this important aspect of the construction. The exact Jordan-RoPE is defined via the non-semisimple representation and the contragredient query to ensure the relative operator produces the coupled features exactly, as derived in Section 3. We explicitly separate this from the stabilized variants in Section 4, where we note the trade-off with the group law. To strengthen the presentation, we have revised the abstract and added a paragraph in the discussion section elaborating on numerical considerations, including the use of higher precision for the matrix exponential and the range of shear parameters that keep perturbations below a threshold. This makes clear that the exact basis is realizable under appropriate implementation conditions. revision: yes
-
Referee: On the small WikiText-103 byte LM the scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, yet RoPE+ALiBi remains strongest overall (Abstract). Because the empirical scope is deliberately limited and the performance gain is intra-family rather than cross-family, the practical advantage of the coupled Jordan basis for distance-modulated interactions is not yet load-bearing for a broad claim.
Authors: We agree with the referee that the empirical results are scoped to provide structural evidence rather than a comprehensive performance comparison. The synthetic diagnostic task isolates the benefit of the coupled oscillatory-polynomial features, and the WikiText experiment shows improvement within the Jordan family. We have updated the abstract to emphasize that the contribution is primarily theoretical and representational, with empirical support for the utility in tasks involving distance-modulated phase interactions. We acknowledge that RoPE+ALiBi outperforms in this setting, as it combines complementary biases, and do not claim superiority over all methods. revision: partial
Circularity Check
Derivation of Jordan-RoPE features is constructive from group representation
full rationale
The paper begins with an external group-theoretic motivation for translation-invariant positional encodings and derives the non-semisimple Jordan block form explicitly, yielding the oscillatory-polynomial features (e.g., e^{-γd} cos(ωd), d e^{-γd} cos(ωd)) as algebraic consequences of the matrix representation and contragredient query action. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-defined quantity, or load-bearing self-citation. Stabilized variants are distinguished precisely because they break the exact law, confirming the core construction stands independently. Empirical diagnostics validate utility without circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- Jordan block parameter (complex eigenvalue and nilpotent coefficient)
axioms (1)
- domain assumption Linear translation-invariant positional encodings admit a group-theoretic classification that includes non-semisimple representations
invented entities (1)
-
Exact Jordan-RoPE operator (non-semisimple one-parameter representation)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Extending Context Window of Large Language Models via Positional Interpolation
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. KERPLE : Kernelized relative positional embedding for length extrapolation. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[3]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of ACL, 2019
work page 2019
-
[4]
FMA : A dataset for music analysis
Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. FMA : A dataset for music analysis. In International Society for Music Information Retrieval Conference, 2017
work page 2017
-
[5]
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending LLM context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
HiPPO : Recurrent memory with optimal polynomial projections
Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R \'e . HiPPO : Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[7]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022
work page 2022
-
[8]
How to train your HiPPO : State space models with generalized orthogonal basis projections
Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R \'e . How to train your HiPPO : State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2023
work page 2023
-
[9]
Diagonal state spaces are as effective as structured state spaces
Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[10]
Enabling factorized piano music modeling and generation with the MAESTRO dataset
Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019
work page 2019
-
[11]
Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008
work page 2008
-
[12]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations, 2019
work page 2019
-
[13]
The impact of positional encoding on length generalization in transformers
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[14]
Algebraic positional encodings
Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encodings. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[15]
Functional interpolation for relative positions improves long context transformers
Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In International Conference on Learning Representations, 2024
work page 2024
-
[16]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017
work page 2017
-
[17]
Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L. Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In International Conference on Machine Learning, 2024
work page 2024
-
[18]
Chaudhari, and Curtis Langlotz
Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael Moseley, Akshay S. Chaudhari, and Curtis Langlotz. LieRE : Lie rotational positional encodings. In Proceedings of the 42nd International Conference on Machine Learning, 2025
work page 2025
-
[19]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN : Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e . Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, 2023
work page 2023
-
[21]
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022
work page 2022
-
[22]
Using group theory to explore the space of positional encodings for attention
Alok Puranik. Using group theory to explore the space of positional encodings for attention. Jane Street Blog, Apr 2026. Available at https://blog.janestreet.com/using-group-theory-to-explore-positional-encodings-attention/; accessed 2026-05-06
work page 2026
-
[23]
Provable benefits of complex parameterizations for structured state space models
Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[24]
Self-attention with relative position representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of NAACL-HLT, 2018
work page 2018
-
[25]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
A length-extrapolatable transformer
Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022
-
[27]
John Thickstun, Zaid Harchaoui, and Sham M. Kakade. Learning features of music from scratch. In International Conference on Learning Representations, 2017
work page 2017
-
[28]
Lloyd N. Trefethen and Mark Embree. Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton University Press, 2005
work page 2005
-
[29]
Lloyd N. Trefethen, Anne E. Trefethen, Satish C. Reddy, and Tobin A. Driscoll. Hydrodynamic stability without eigenvalues. Science, 261 0 (5121): 0 578--584, 1993
work page 1993
-
[30]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[31]
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. Group representational position encoding. arXiv preprint arXiv:2512.07805, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.