Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Yaobo Zhang

arxiv: 2605.04217 · v2 · pith:BEX35MNInew · submitted 2026-05-05 · 💻 cs.LG · cs.CL

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Yaobo Zhang This is my paper

Pith reviewed 2026-05-22 10:35 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords relative positional encodingJordan-RoPEnon-semisimplerotary positional encodingJordan blocksdistance-modulated phaseoscillatory-polynomial featurestransformer attention

0 comments

The pith

Non-semisimple Jordan blocks generate distance-modulated oscillatory features in relative positional encoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Jordan-RoPE, which replaces the semisimple rotation of standard RoPE with a defective Jordan block that couples a complex eigenvalue to a nilpotent part. This single block produces relative operators whose action on query-key lags yields features such as damped cosines and sines each multiplied by the lag distance itself. A sympathetic reader would care because the construction supplies a coupled basis of the form d e^{i ω d} rather than treating phase and distance as independent channels. The authors supply an exact one-parameter formulation, its real-block version, and the required contragredient query map, then test the idea on kernel diagnostics and a synthetic task built around distance-modulated phases. Results show the coupled basis is useful precisely when the target interactions involve such modulation, although RoPE plus ALiBi still leads on the small WikiText model examined.

Core claim

The central claim is that a non-semisimple one-parameter representation realized by a complex Jordan block produces, for causal lag d, the oscillatory-polynomial features e^{-γd} cos(ωd), e^{-γd} sin(ωd), d e^{-γd} cos(ωd) and d e^{-γd} sin(ωd), thereby realizing a distance-modulated phase basis d e^{iωd} rather than merely adjoining a separate distance channel to rotary encoding.

What carries the argument

The non-semisimple complex Jordan block that places a rotary eigenvalue and a nilpotent element together, together with the contragredient query action needed to compensate for the non-orthogonal positional map.

If this is right

Attention logits can now incorporate query-key interactions in which phase is scaled by distance inside a single basis vector.
Stabilized variants trade the exact group law for bounded shear and improved numerical behavior.
The exact representation requires the contragredient query action to keep the non-orthogonal map from distorting the relative operator.
Kernel diagnostics confirm that the oscillatory-polynomial features appear exactly when the Jordan block is used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Higher-order nilpotent blocks could generate quadratic or higher polynomial multipliers on the same oscillatory carrier.
The same non-semisimple construction might be applied to other group representations used for positional encoding.
Hybrid models could combine the exact Jordan block on some heads with stabilized or ALiBi blocks on others.
The structural evidence suggests testing whether the distance-modulated basis improves sample efficiency on tasks whose optimal attention patterns contain explicit lag scaling.

Load-bearing premise

The non-semisimple Jordan block representation remains useful and numerically stable once embedded inside transformer attention, and the contragredient query action compensates for non-orthogonality without introducing uncontrolled artifacts.

What would settle it

On the Jordan-friendly synthetic language-model task, if the coupled Jordan basis produces no improvement over RoPE or direct-sum baselines when the target explicitly contains distance-modulated phase interactions, the usefulness claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.04217 by Yaobo Zhang.

**Figure 1.** Figure 1: Primitive pre-softmax relative-position bases. RoPE supplies phase features, direct-sum view at source ↗

**Figure 1.** Figure 1: Geometric intuition of Jordan-RoPE. RoPE samples reduced points on the Fourier-character curve. A direct-sum [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Kernel-level mixed-target extrapolation. The raw/exact Jordan basis fits the unbounded view at source ↗

**Figure 3.** Figure 3: Jordan-friendly synthetic LM accuracy. Stabilized Jordan-RoPE preserves high long-lag view at source ↗

**Figure 3.** Figure 3: A Transformer can use the coupled Jordan mode when the teacher kernel requires it. On the synthetic query-LM [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: WikiText-103 byte LM validation loss. The Scaled-exact variant with view at source ↗

read the original abstract

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-\gamma d}\cos(\omega d)$, $e^{-\gamma d}\sin(\omega d)$, $d e^{-\gamma d}\cos(\omega d)$, and $d e^{-\gamma d}\sin(\omega d)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{i\omega d}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jordan-RoPE gives a clean algebraic way to embed distance-modulated phase features into relative positional encodings using one defective complex Jordan block, but the supporting experiments stay small and the exact version looks numerically delicate.

read the letter

The main point is that this paper constructs a relative positional operator from a non-semisimple Jordan block that produces terms like d times e to the minus gamma d times cos or sin of omega d, all inside a single matrix exponential rather than by adding separate channels. That is a real structural difference from RoPE or ALiBi, and the group-theoretic starting point leads to an explicit real block form plus the contragredient query adjustment needed for the inner product to work.

Referee Report

2 major / 2 minor

Summary. The paper proposes Jordan-RoPE, a relative positional encoding based on non-semisimple complex Jordan blocks containing both a rotary eigenvalue and a nilpotent component. It claims this yields an exact relative operator producing coupled oscillatory-polynomial features such as e^{-γd} cos(ωd), e^{-γd} sin(ωd), d e^{-γd} cos(ωd), and d e^{-γd} sin(ωd) for causal lag d, realizing a distance-modulated phase basis rather than a simple additive distance channel. The manuscript gives the real block form, specifies the required contragredient query action for the non-orthogonal map, distinguishes the exact representation from bounded-shear stabilized variants that improve numerics but break the group law, and reports positive results on a synthetic diagnostic language-model task plus a small WikiText-103 byte LM where a scaled-exact variant outperforms RoPE and direct-sum Jordan baselines (while RoPE+ALiBi remains strongest overall).

Significance. If the exact features can be realized stably inside attention, the construction supplies a principled, group-theoretic route to distance-modulated rotary encodings that naturally couple polynomial and oscillatory terms; this could matter for sequence tasks whose target interactions depend on both phase and lag. The structural derivation, explicit real-block realization, synthetic diagnostic that isolates the coupled basis, and explicit separation of exact versus stabilized forms are genuine strengths. The current evidence is deliberately scoped as structural rather than a broad performance claim, and the limited model size plus the fact that RoPE+ALiBi still wins overall keep the practical impact modest pending larger-scale tests.

major comments (2)

[Abstract and formulation of Exact Jordan-RoPE] The central claim that the non-semisimple Jordan block produces the exact relative operator with features e^{-γd} cos(ωd) and d e^{-γd} cos(ωd) (Abstract) rests on embedding the non-orthogonal positional map and applying the specified contragredient query action. The manuscript itself notes that nilpotent components are numerically fragile and therefore introduces bounded-shear stabilizations that break the exact group law; this directly affects whether the distance-modulated phase basis can appear in practice without uncontrolled perturbations from rounding in the matrix exponential or shear term.
[Empirical evaluation] On the small WikiText-103 byte LM the scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, yet RoPE+ALiBi remains strongest overall (Abstract). Because the empirical scope is deliberately limited and the performance gain is intra-family rather than cross-family, the practical advantage of the coupled Jordan basis for distance-modulated interactions is not yet load-bearing for a broad claim.

minor comments (2)

[Abstract] The causal-lag definition d = i - j ≥ 0 is stated clearly in the Abstract; ensure the same indexing convention is used without ambiguity when the contragredient query action is defined in the main text.
[Conclusion] The paper appropriately describes its evidence as “structural rather than a broad performance claim”; repeating this framing in the conclusion would help readers calibrate expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications and indicating revisions where appropriate.

read point-by-point responses

Referee: The central claim that the non-semisimple Jordan block produces the exact relative operator with features e^{-γd} cos(ωd) and d e^{-γd} cos(ωd) (Abstract) rests on embedding the non-orthogonal positional map and applying the specified contragredient query action. The manuscript itself notes that nilpotent components are numerically fragile and therefore introduces bounded-shear stabilizations that break the exact group law; this directly affects whether the distance-modulated phase basis can appear in practice without uncontrolled perturbations from rounding in the matrix exponential or shear term.

Authors: We thank the referee for highlighting this important aspect of the construction. The exact Jordan-RoPE is defined via the non-semisimple representation and the contragredient query to ensure the relative operator produces the coupled features exactly, as derived in Section 3. We explicitly separate this from the stabilized variants in Section 4, where we note the trade-off with the group law. To strengthen the presentation, we have revised the abstract and added a paragraph in the discussion section elaborating on numerical considerations, including the use of higher precision for the matrix exponential and the range of shear parameters that keep perturbations below a threshold. This makes clear that the exact basis is realizable under appropriate implementation conditions. revision: yes
Referee: On the small WikiText-103 byte LM the scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, yet RoPE+ALiBi remains strongest overall (Abstract). Because the empirical scope is deliberately limited and the performance gain is intra-family rather than cross-family, the practical advantage of the coupled Jordan basis for distance-modulated interactions is not yet load-bearing for a broad claim.

Authors: We agree with the referee that the empirical results are scoped to provide structural evidence rather than a comprehensive performance comparison. The synthetic diagnostic task isolates the benefit of the coupled oscillatory-polynomial features, and the WikiText experiment shows improvement within the Jordan family. We have updated the abstract to emphasize that the contribution is primarily theoretical and representational, with empirical support for the utility in tasks involving distance-modulated phase interactions. We acknowledge that RoPE+ALiBi outperforms in this setting, as it combines complementary biases, and do not claim superiority over all methods. revision: partial

Circularity Check

0 steps flagged

Derivation of Jordan-RoPE features is constructive from group representation

full rationale

The paper begins with an external group-theoretic motivation for translation-invariant positional encodings and derives the non-semisimple Jordan block form explicitly, yielding the oscillatory-polynomial features (e.g., e^{-γd} cos(ωd), d e^{-γd} cos(ωd)) as algebraic consequences of the matrix representation and contragredient query action. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-defined quantity, or load-bearing self-citation. Stabilized variants are distinguished precisely because they break the exact law, confirming the core construction stands independently. Empirical diagnostics validate utility without circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on a group-theoretic view of translation-invariant positional encodings and on the algebraic properties of a single defective Jordan block; one free parameter controls the complex eigenvalue and nilpotent strength, and the new entity is the coupled Jordan positional operator itself.

free parameters (1)

Jordan block parameter (complex eigenvalue and nilpotent coefficient)
Controls the decay rate γ, frequency ω, and polynomial degree in the generated features; chosen or tuned per model.

axioms (1)

domain assumption Linear translation-invariant positional encodings admit a group-theoretic classification that includes non-semisimple representations
Invoked in the motivation paragraph to justify studying the defective Jordan case.

invented entities (1)

Exact Jordan-RoPE operator (non-semisimple one-parameter representation) no independent evidence
purpose: Generates the coupled oscillatory-polynomial relative features inside attention logits
New construction introduced by the paper; no independent evidence outside the derivation and small-scale experiments is provided.

pith-pipeline@v0.9.0 · 5815 in / 1526 out tokens · 39205 ms · 2026-05-22T10:35:15.370881+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Ramadge, and Alexander I

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. KERPLE : Kernelized relative positional embedding for length extrapolation. In Advances in Neural Information Processing Systems, 2022

work page 2022
[3]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of ACL, 2019

work page 2019
[4]

FMA : A dataset for music analysis

Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. FMA : A dataset for music analysis. In International Society for Music Information Retrieval Conference, 2017

work page 2017
[5]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending LLM context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

HiPPO : Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R \'e . HiPPO : Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, 2020

work page 2020
[7]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022
[8]

How to train your HiPPO : State space models with generalized orthogonal basis projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R \'e . How to train your HiPPO : State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2023

work page 2023
[9]

Diagonal state spaces are as effective as structured state spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems, 2022

work page 2022
[10]

Enabling factorized piano music modeling and generation with the MAESTRO dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019

work page 2019
[11]

Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008

work page 2008
[12]

Dai, Matthew D

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations, 2019

work page 2019
[13]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In Advances in Neural Information Processing Systems, 2023

work page 2023
[14]

Algebraic positional encodings

Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encodings. In Advances in Neural Information Processing Systems, 2024

work page 2024
[15]

Functional interpolation for relative positions improves long context transformers

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In International Conference on Learning Representations, 2024

work page 2024
[16]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017
[17]

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L. Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In International Conference on Machine Learning, 2024

work page 2024
[18]

Chaudhari, and Curtis Langlotz

Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael Moseley, Akshay S. Chaudhari, and Curtis Langlotz. LieRE : Lie rotational positional encodings. In Proceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025
[19]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN : Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e . Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, 2023

work page 2023
[21]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

work page 2022
[22]

Using group theory to explore the space of positional encodings for attention

Alok Puranik. Using group theory to explore the space of positional encodings for attention. Jane Street Blog, Apr 2026. Available at https://blog.janestreet.com/using-group-theory-to-explore-positional-encodings-attention/; accessed 2026-05-06

work page 2026
[23]

Provable benefits of complex parameterizations for structured state space models

Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models. In Advances in Neural Information Processing Systems, 2024

work page 2024
[24]

Self-attention with relative position representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of NAACL-HLT, 2018

work page 2018
[25]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

A length-extrapolatable transformer

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022

work page arXiv 2022
[27]

John Thickstun, Zaid Harchaoui, and Sham M. Kakade. Learning features of music from scratch. In International Conference on Learning Representations, 2017

work page 2017
[28]

Trefethen and Mark Embree

Lloyd N. Trefethen and Mark Embree. Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton University Press, 2005

work page 2005
[29]

Trefethen, Anne E

Lloyd N. Trefethen, Anne E. Trefethen, Satish C. Reddy, and Tobin A. Driscoll. Hydrodynamic stability without eigenvalues. Science, 261 0 (5121): 0 578--584, 1993

work page 1993
[30]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

work page 2017
[31]

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. Group representational position encoding. arXiv preprint arXiv:2512.07805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Ramadge, and Alexander I

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. KERPLE : Kernelized relative positional embedding for length extrapolation. In Advances in Neural Information Processing Systems, 2022

work page 2022

[3] [3]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer- XL : Attentive language models beyond a fixed-length context. In Proceedings of ACL, 2019

work page 2019

[4] [4]

FMA : A dataset for music analysis

Micha \"e l Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. FMA : A dataset for music analysis. In International Society for Music Information Retrieval Conference, 2017

work page 2017

[5] [5]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending LLM context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

HiPPO : Recurrent memory with optimal polynomial projections

Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher R \'e . HiPPO : Recurrent memory with optimal polynomial projections. In Advances in Neural Information Processing Systems, 2020

work page 2020

[7] [7]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R \'e . Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022

work page 2022

[8] [8]

How to train your HiPPO : State space models with generalized orthogonal basis projections

Albert Gu, Isys Johnson, Aman Timalsina, Atri Rudra, and Christopher R \'e . How to train your HiPPO : State space models with generalized orthogonal basis projections. In International Conference on Learning Representations, 2023

work page 2023

[9] [9]

Diagonal state spaces are as effective as structured state spaces

Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces. In Advances in Neural Information Processing Systems, 2022

work page 2022

[10] [10]

Enabling factorized piano music modeling and generation with the MAESTRO dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019

work page 2019

[11] [11]

Nicholas J. Higham. Functions of Matrices: Theory and Computation. SIAM, 2008

work page 2008

[12] [12]

Dai, Matthew D

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. In International Conference on Learning Representations, 2019

work page 2019

[13] [13]

The impact of positional encoding on length generalization in transformers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. In Advances in Neural Information Processing Systems, 2023

work page 2023

[14] [14]

Algebraic positional encodings

Konstantinos Kogkalidis, Jean-Philippe Bernardy, and Vikas Garg. Algebraic positional encodings. In Advances in Neural Information Processing Systems, 2024

work page 2024

[15] [15]

Functional interpolation for relative positions improves long context transformers

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In International Conference on Learning Representations, 2024

work page 2024

[16] [16]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017

work page 2017

[17] [17]

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L. Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In International Conference on Machine Learning, 2024

work page 2024

[18] [18]

Chaudhari, and Curtis Langlotz

Sophie Ostmeier, Brian Axelrod, Maya Varma, Michael Moseley, Akshay S. Chaudhari, and Curtis Langlotz. LieRE : Lie rotational positional encodings. In Proceedings of the 42nd International Conference on Machine Learning, 2025

work page 2025

[19] [19]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN : Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e

Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R \'e . Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, 2023

work page 2023

[21] [21]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022

work page 2022

[22] [22]

Using group theory to explore the space of positional encodings for attention

Alok Puranik. Using group theory to explore the space of positional encodings for attention. Jane Street Blog, Apr 2026. Available at https://blog.janestreet.com/using-group-theory-to-explore-positional-encodings-attention/; accessed 2026-05-06

work page 2026

[23] [23]

Provable benefits of complex parameterizations for structured state space models

Yuval Ran-Milo, Eden Lumbroso, Edo Cohen-Karlik, Raja Giryes, Amir Globerson, and Nadav Cohen. Provable benefits of complex parameterizations for structured state space models. In Advances in Neural Information Processing Systems, 2024

work page 2024

[24] [24]

Self-attention with relative position representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of NAACL-HLT, 2018

work page 2018

[25] [25]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

A length-extrapolatable transformer

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022

work page arXiv 2022

[27] [27]

John Thickstun, Zaid Harchaoui, and Sham M. Kakade. Learning features of music from scratch. In International Conference on Learning Representations, 2017

work page 2017

[28] [28]

Trefethen and Mark Embree

Lloyd N. Trefethen and Mark Embree. Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton University Press, 2005

work page 2005

[29] [29]

Trefethen, Anne E

Lloyd N. Trefethen, Anne E. Trefethen, Satish C. Reddy, and Tobin A. Driscoll. Hydrodynamic stability without eigenvalues. Science, 261 0 (5121): 0 578--584, 1993

work page 1993

[30] [30]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

work page 2017

[31] [31]

Group Representational Position Encoding

Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. Group representational position encoding. arXiv preprint arXiv:2512.07805, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025