Selective Rotary Position Embedding

Antonio Orvieto; Arshia Afzal; Frank Hutter; Sajad Movahedi; Timur Carstensen; Volkan Cevher

arxiv: 2511.17388 · v2 · submitted 2025-11-21 · 💻 cs.CL · cs.LG

Selective Rotary Position Embedding

Sajad Movahedi , Timur Carstensen , Arshia Afzal , Frank Hutter , Antonio Orvieto , Volkan Cevher This is my paper

Pith reviewed 2026-05-17 20:33 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords rotary position embeddingsselective mechanismstransformerslanguage modelinggated transformerssequence tasksattention mechanismsposition encoding

0 comments

The pith

Selective RoPE replaces fixed rotation angles in position embeddings with input-dependent ones that work for both linear and softmax transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Selective RoPE as an input-dependent rotary embedding that generalizes standard RoPE by allowing rotations at arbitrary angles. It shows that softmax attention already performs similar hidden rotations on query-key pairs. The method separates forgetting in the real component from positional encoding in the imaginary component within state-space and gated models. When added to gated transformers, Selective RoPE improves results on language modeling and on tasks that require copying, state tracking, and retrieval. Readers would care because this selectivity offers a more flexible way to encode order that might unify handling across different transformer variants.

Core claim

Selective RoPE is an input-dependent rotary embedding mechanism that generalizes RoPE and enables rotation in arbitrary angles for both linear and softmax transformers, with the observation that softmax attention already performs a hidden form of these rotations on query-key pairs while the real part manages forgetting and the imaginary part encodes positions through rotations in state-space models and gated linear transformers.

What carries the argument

Selective RoPE, the input-dependent rotary embedding that computes rotation angles from the current input rather than fixing them in advance.

If this is right

Gated transformers equipped with Selective RoPE achieve better performance on language modeling.
The approach yields improvements on difficult sequence tasks such as copying, state tracking, and retrieval.
Softmax attention implicitly applies input-dependent rotations to query-key pairs.
In state-space models and gated linear transformers, the real part handles forgetting while the imaginary part encodes positions via rotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be applied to non-gated transformers to check whether the gains extend beyond gated architectures.
Combining Selective RoPE with other selective state mechanisms might produce hybrids that handle longer contexts more efficiently.
Testing on much longer sequences would reveal whether input-dependent angles reduce position-related degradation better than fixed embeddings.
The implicit rotational structure uncovered in attention might prompt new ways to interpret order capture in transformers without explicit positional signals.

Load-bearing premise

Input-dependent rotations will reliably improve performance on language modeling and sequence tasks without introducing instability, overfitting, or requiring extensive hyperparameter tuning across different model scales.

What would settle it

Training gated transformers with Selective RoPE on standard language modeling and sequence benchmarks and finding no consistent gains or increased instability compared to fixed RoPE would disprove the claimed benefits.

Figures

Figures reproduced from arXiv: 2511.17388 by Antonio Orvieto, Arshia Afzal, Frank Hutter, Sajad Movahedi, Timur Carstensen, Volkan Cevher.

**Figure 2.** Figure 2: The distribution of the phase temperatures in RoPE vs. Selective RoPE. ϵ is the inverse of the RoPE base frequency and the upper-bound of query-key angle in our temperature. Details about the parameterization available in Appendix A.3.1. The equivalence of the RFF kernel in (8). For a limited number of samples, D, we instead choose the variance of the RFFs as shown in Theorem 1 (Appendix A.3), which prov… view at source ↗

**Figure 3.** Figure 3: The effects of windowing on the spectrogram of a finite sample of a sequence. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Pseudocode of Selective RoPE. In the implementation of Selective RoPE we make several design choices that go beyond the architecture described in Section 3.3: Following Zhang et al. (2024), where learning the random features introduced by Choromanski et al. (2021) was shown to be more effective, we make the parameters ω in Selective RoPE learnable. This makes the rotations input-dependent and learnabl… view at source ↗

**Figure 5.** Figure 5: Prefill throughput on NVIDIA B200 with batch size=1 We implement Selective RoPE in PyTorch and integrate it into flash-linear-attention (Yang & Zhang, 2024) for our experiments. Using the RoPE trick (cf. section 2), we are able to implement our method as a prelude to RoPE where we determine the sin and cos from the input as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Copying accuracy of GLA with CIs. Dashed line is the training sequence length [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: MQAR results. MQAR. We evaluate GLA + Selective RoPE on Multi-Query Associative Recall, following the same experimental setup as in Arora et al. (2024a, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: State tracking peformance of GLA, Transformer, and DeltaNet with different positional [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Selective RoPE in PyTorch. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

read the original abstract

Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Selective RoPE, an input-dependent rotary position embedding mechanism that generalizes standard fixed-angle RoPE to support arbitrary rotation angles. It applies this to both linear and softmax transformers, claims that standard softmax attention implicitly performs a form of these input-dependent rotations on query-key pairs, and shows that in state-space and gated linear models the real/imaginary parts separately handle forgetting and positional encoding. The method is validated by integrating Selective RoPE into gated transformers, with reported empirical improvements on language modeling and sequence tasks including copying, state tracking, and retrieval.

Significance. If the input-dependent rotations can be shown to preserve (or explicitly relax) RoPE's relative-position inductive bias while delivering the claimed gains, the work would usefully connect rotary embeddings with selective mechanisms already successful in linear attention. The observation that softmax attention performs hidden rotations is potentially insightful for understanding implicit positional structure, and the empirical results on retrieval and state-tracking tasks suggest practical value for long-context modeling if the gains hold under controlled ablations.

major comments (2)

[Introduction and §3] Introduction and §3 (method definition): the claim that Selective RoPE 'generalizes RoPE' and 'enables rotation in arbitrary angles' is not accompanied by an explicit statement or proof that the input-dependent angle θ_i(x_m, x_n) still yields an effective rotation depending only on relative offset (m-n). Without such a constraint or derivation, the attention score loses the translation invariance that is the core inductive bias of standard RoPE (Eq. (1) in the original RoPE formulation). This directly affects whether gains on copying/retrieval will transfer to standard language modeling.
[§4.2] §4.2 (empirical validation): the reported improvements on language modeling and sequence tasks lack ablations that isolate the effect of input-dependent angles from other changes in the gated transformer architecture. In particular, it is unclear whether performance gains persist when the selective angles are replaced by fixed but learned angles, which would test whether the input-dependence itself (rather than simply more flexible rotations) is load-bearing.

minor comments (2)

Notation for the selective angle function should be introduced once and used consistently; currently the dependence on both query and key (or on single token) is described differently across the abstract, introduction, and method sections.
The statement that 'softmax attention already performs a hidden form of these rotations' would benefit from a short derivation or explicit mapping to the standard QK dot-product under RoPE, rather than leaving it as an observation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our paper 'Selective Rotary Position Embedding'. We have carefully considered the major comments and provide point-by-point responses below, along with planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Introduction and §3] Introduction and §3 (method definition): the claim that Selective RoPE 'generalizes RoPE' and 'enables rotation in arbitrary angles' is not accompanied by an explicit statement or proof that the input-dependent angle θ_i(x_m, x_n) still yields an effective rotation depending only on relative offset (m-n). Without such a constraint or derivation, the attention score loses the translation invariance that is the core inductive bias of standard RoPE (Eq. (1) in the original RoPE formulation). This directly affects whether gains on copying/retrieval will transfer to standard language modeling.

Authors: We thank the referee for highlighting this important point regarding the inductive bias. Upon reflection, the current formulation of Selective RoPE allows the rotation angle to depend on the specific input tokens x_m and x_n, which indeed means it does not strictly enforce dependence only on the relative position (m-n) as in standard RoPE. This is intentional to introduce selectivity similar to gating mechanisms. However, we agree that an explicit discussion or derivation is missing. In the revised manuscript, we will add a clarification in Section 3 explaining how the input-dependent rotations relate to relative positions, including any preserved or relaxed properties, and discuss implications for transfer to language modeling tasks. We believe this will address the concern while maintaining the novelty of the selective approach. revision: yes
Referee: [§4.2] §4.2 (empirical validation): the reported improvements on language modeling and sequence tasks lack ablations that isolate the effect of input-dependent angles from other changes in the gated transformer architecture. In particular, it is unclear whether performance gains persist when the selective angles are replaced by fixed but learned angles, which would test whether the input-dependence itself (rather than simply more flexible rotations) is load-bearing.

Authors: We agree that isolating the input-dependence is crucial for validating the contribution of Selective RoPE. The current experiments integrate Selective RoPE into gated transformers but do not include the suggested ablation with fixed learned angles. We will add these ablations in the revised version, comparing Selective RoPE against variants with fixed but learned rotation angles on the language modeling, copying, state tracking, and retrieval tasks. This will help demonstrate whether the dynamic, input-dependent nature provides additional benefits beyond increased flexibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Selective RoPE introduced as independent generalization

full rationale

The provided abstract and description present Selective RoPE as a novel input-dependent rotary mechanism that generalizes fixed-angle RoPE and reveals implicit rotations already latent in softmax attention. No load-bearing derivation step is shown to reduce by construction to a fitted input, self-citation chain, or renamed ansatz. The claims rest on the proposed mechanism's ability to enable arbitrary-angle rotations for both linear and softmax transformers, with empirical validation on language modeling and sequence tasks offered as external support rather than tautological prediction. The derivation chain is therefore self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed in the provided text. The work builds on standard transformer assumptions and prior RoPE/selective gating results.

pith-pipeline@v0.9.0 · 5495 in / 1029 out tokens · 23761 ms · 2026-05-17T20:33:13.495869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

A performant linear transformer requires both: (a) rotation and (b) gating.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
cs.CV 2026-04 unverdicted novelty 4.0

Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

plainlm: Language model pretraining in pytorch

Niccolò Ajroldi. plainlm: Language model pretraining in pytorch. https://github.com/Niccolo-Ajroldi/plainLM, 2024

work page 2024
[2]

Arora, S

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. R \' e . Zoology: Measuring and Improving Recall in Efficient Language Models . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024 a

work page 2024
[3]

Arora, S

S. Arora, S. Eyuboglu, M. Zhang, A. Timalsina, S. Alberti, J. Zou, A. Rudra, and C. Re. Simple linear attention language models balance the recall-throughput tradeoff . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , v...

work page 2024
[4]

M. Beck, K. P \"o ppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. xLSTM: Extended Long Short-Term Memory . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems (...

work page 2024
[5]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. arXiv:2204.06745 [cs.CL], 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Cand\`es and C

E. Cand\`es and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67 0 (6): 0 906--956, 2014. doi:https://doi.org/10.1002/cpa.21455. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21455

work page doi:10.1002/cpa.21455 2014
[7]

Chi, T.-H

T.-C. Chi, T.-H. Fan, P. Ramadge, and A. Rudnicky. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022

work page 2022
[8]

Choromanski, V

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021

work page 2021
[9]

Choromanski, H

K. Choromanski, H. Chen, H. Lin Y. Ma, A. Sehanobish, D. Jain, M. Ryoo, J. Varley, A. Zeng, V. Likhosherstov, D. Kalashnikov, V. Sindhwani, and A. Weller. Hybrid Random Features . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022

work page 2022
[10]

N. M. Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons. Theoretical Foundations of Deep Selective State-Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024

work page 2024
[11]

T. Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024
[12]

Dao and A

T. Dao and A. Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learnin...

work page 2024
[13]

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , pp.\ 16344--16359, 2022

work page 2022
[14]

S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, and C. Gulcehre. Griffin : Mixing gated linear recurrences with local attention for efficient language models. arXiv:2402.19427 [cs.LG], 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 2024. URL https://zenodo.org/records/12608602

work page arXiv 2024
[16]

Grazzi, J

R. Grazzi, J. Siems, A. Zela, J. Franke, F. Hutter, and M. Pontil. Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues . In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025

work page 2025
[17]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear time sequence modeling with selective state spaces. arXiv:2312.00752 [cs.LG], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Re. HiPPO: Recurrent Memory with Optimal Polynomial Projections . In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, and H. Lin (eds.), Proceedings of the 33rd International Conference on Advances in Neural Information Processing Systems ( N eur IPS '20) , 2020

work page 2020
[19]

A. Gu, K. Goel, and C. Re. Efficiently Modeling Long Sequences with Structured State Spaces . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022 a

work page 2022
[20]

A. Gu, A. Gupta, K. Goel, and C. Ré. On the Parameterization and Initialization of Diagonal State Space Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022 b

work page 2022
[21]

F. Harris. On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE, 66 0 (1): 0 51--83, 2005

work page 2005
[22]

Henry, P

A. Henry, P. Dachapally, S. Pawar, and Y. Chen. Query-Key Normalization for Transformers . In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020

work page 2020
[23]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997. Based on TR FKI-207-95, TUM (1995)

work page 1997
[24]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models. In S. Koyejo, S. Mohamed, A. ...

work page 2022
[25]

J. Hu, Y. Pan, J. Du, D. Lan, X. Tang, Q. Wen, Y. Liang, and W. Sun. Comba: Improving Bilinear RNNs with Closed-loop Control . arXiv:2506.02475 [cs.LG], 2025

work page arXiv 2025
[26]

Smith III

J. Smith III. Spectral audio signal processing. (No Title), 2011

work page 2011
[27]

Jelassi, D

S. Jelassi, D. Brandfonbrener, S. Kakade, and E. Malach. Repeat After Me: Transformers are Better than State Space Models at Copying . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machin...

work page 2024
[28]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B . arXiv:2310.06825 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Kacham, V

P. Kacham, V. Mirrokni, and P. Zhong. PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels . arXiv:2310.01655 [cs.LG], 2023

work page arXiv 2023
[30]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention . In H. Daume III and A. Singh (eds.), Proceedings of the 37th International Conference on Machine Learning ( ICML '20) , volume 98. Proceedings of Machine Learning Research, 2020

work page 2020
[31]

Kazemnejad, I

A. Kazemnejad, I. Padhi, K. Natesan, P. Das, and S. Reddy. The impact of positional encoding on length generalization in transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023

work page 2023
[32]

Salimans D

T. Salimans D. Kingma. Weight Normalization : A simple reparameterization to accelerate training of deep neural networks. In D. Lee, M. Sugiyama, U. von Luxburg , I. Guyon, and R. Garnett (eds.), Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '16) , volume 29, 2016

work page 2016
[33]

S. Li, C. You, G. Guruganesh, J. Ainslie, S. Ontanon, M. Zaheer, S. Sanghai, Y. Yang, S. Kumar, and S. Bhojanapalli. Functional Interpolation for Relative Positions Improves Long Context Transformers . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024
[34]

Z. Lin, E. Nikishin, X. He, and A. Courville. Forgetting Transformer: Softmax Attention with a Forget Gate . In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025

work page 2025
[35]

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang. Transformers Learn Shortcuts to Automata . In The Eleventh International Conference on Learning Representations ( ICLR '23) . ICLR, 2023

work page 2023
[36]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR : Stochastic gradient descent with warm restarts. In The Fifth International Conference on Learning Representations ( ICLR '17) . ICLR, 2017

work page 2017
[37]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations ( ICLR '19) . ICLR, 2019

work page 2019
[38]

Martin and C

E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018

work page 2018
[39]

Merrill, J

W. Merrill, J. Petty, and A. Sabharwal. The Illusion of State in State-Space Models . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Research. PMLR, 2024

work page 2024
[40]

2025 , archivePrefix=

D. Okpekpe and A. Orvieto. When recalling in-context, Transformers are not SSMs . arXiv:2508.19029 [cs.LG], 2025

work page arXiv 2025
[41]

Oppenheim

A. Oppenheim. Discrete-time signal processing. Pearson Education India, 1999

work page 1999
[42]

Orvieto and R

A. Orvieto and R. Gower. In search of adam's secret sauce. arXiv:2505.21829 [cs.LG], 2025

work page arXiv 2025
[43]

Orvieto, S

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences . In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning ( ICML '23) , volume 202 of Proceedings of Machine Learning R...

work page 2023
[44]

Orvieto, S

A. Orvieto, S. De, C. Gulcehre, R. Pascanu, and S. Smith. Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Le...

work page 2024
[45]

Penedo, H

G. Penedo, H. Kydl \' c ek, L. Ben allal, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information P...

work page 2024
[46]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. YaRN: Efficient Context Window Extension of Large Language Models . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024
[47]

B. Peng, R. Zhang, D. Goldstein, E. Alcaide, X. Du, H. Hou, J. Lin, J. Liu, J. Lu, W. Merrill, G. Song, K. Tan, S. Utpala, N. Wilce, J. Wind, T. Wu, D. Wuttke, and C. Zhou-Zheng. RWKV-7 "Goose" with Expressive Dynamic State Evolution . arXiv:2503.14456 [cs.CL], 2025

work page arXiv 2025
[48]

H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and L. Kong. Random Feature Attention . In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021

work page 2021
[49]

M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Bj \"o" rn Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conferenc...

work page 2024
[50]

Z. Qin, S. Yang, and Y. Zhong. Hierarchically Gated Recurrent Neural Network for Sequence Modeling . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023

work page 2023
[51]

Z. Qin, S. Yang, W. Sun, X. Shen, D. Li, W. Sun, and Y. Zhong. HGRN2: Gated Linear RNNs with State Expansion . arXiv:2404.07904 [cs.CL], 2024

work page arXiv 2024
[52]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In K. Chaudhuri and R. Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning ( ICML '19) , volume 97. Proceedings of Machine Learning Research, 2019

work page 2019
[53]

Rahimi and B

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Proceedings of the 21st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '07) , 2007

work page 2007
[54]

Ran-Milo, E

Y. Ran-Milo, E. Lumbroso, E. Cohen-Karlik, R. Giryes, A. Globerson, and N. Cohen. Provable Benefits of Complex Parameterizations for Structured State Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Syste...

work page 2024
[55]

Schlag, K

I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers . In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning ( ICML '21) , volume 139 of Proceedings of Machine Learning Research. PMLR, 2021

work page 2021
[56]

P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv:1803.02155 [cs.CL], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[57]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, and R. Grazzi. DeltaProduct : Increasing the expressivity of deltanet through products of householders. arXiv:2502.10297 [cs.LG], 2025

work page arXiv 2025
[58]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864 [cs.CL], 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive Network: A Successor to Transformer for Large Language Models . arXiv:2307.08621 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi \` e re, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA : Open and efficient foundation language models. arXiv:2302.13971 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '17) . Curran ...

work page 2017
[62]

An Empirical Study of Mamba-based Language Models

R. Waleffe, W. Byeon, D. Riach, B. Norick, V. Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, G. Kulshreshtha, V. Singh, J. Casper, J. Kautz, M. Shoeybi, and B. Catanzaro. An Empirical Study of Mamba-based Language Models . arXiv:2406.07887 [cs.LG], 2024

work page internal anchor Pith review arXiv 2024
[63]

Widrow, , and M

B. Widrow, , and M. E. Hoff. Adaptive switching circuits, pp.\ 123–134. MIT Press, Cambridge, MA, USA, 1988

work page 1988
[64]

Yang and Y

S. Yang and Y. Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention

work page 2024
[65]

S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated Linear Attention Transformers with Hardware-Efficient Training . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Rese...

work page 2024
[66]

S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024 b

work page 2024
[67]

S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025 a

work page 2025
[68]

S. Yang, Y. Shen, K. Wen, S. Tan, M. Mishra, L. Ren, R. Panda, and Y. Kim. PaTH Attention : Position encoding via accumulating householder transformations. arXiv:2505.16381 [cs.CL], 2025 b

work page arXiv 2025
[69]

Zhang, K

M. Zhang, K. Bhatia, H. Kumbong, and C. Ré. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024
[70]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

plainlm: Language model pretraining in pytorch

Niccolò Ajroldi. plainlm: Language model pretraining in pytorch. https://github.com/Niccolo-Ajroldi/plainLM, 2024

work page 2024

[2] [2]

Arora, S

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, and C. R \' e . Zoology: Measuring and Improving Recall in Efficient Language Models . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024 a

work page 2024

[3] [3]

Arora, S

S. Arora, S. Eyuboglu, M. Zhang, A. Timalsina, S. Alberti, J. Zou, A. Rudra, and C. Re. Simple linear attention language models balance the recall-throughput tradeoff . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , v...

work page 2024

[4] [4]

M. Beck, K. P \"o ppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. xLSTM: Extended Long Short-Term Memory . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems (...

work page 2024

[5] [5]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. arXiv:2204.06745 [cs.CL], 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Cand\`es and C

E. Cand\`es and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67 0 (6): 0 906--956, 2014. doi:https://doi.org/10.1002/cpa.21455. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21455

work page doi:10.1002/cpa.21455 2014

[7] [7]

Chi, T.-H

T.-C. Chi, T.-H. Fan, P. Ramadge, and A. Rudnicky. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022

work page 2022

[8] [8]

Choromanski, V

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021

work page 2021

[9] [9]

Choromanski, H

K. Choromanski, H. Chen, H. Lin Y. Ma, A. Sehanobish, D. Jain, M. Ryoo, J. Varley, A. Zeng, V. Likhosherstov, D. Kalashnikov, V. Sindhwani, and A. Weller. Hybrid Random Features . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022

work page 2022

[10] [10]

N. M. Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons. Theoretical Foundations of Deep Selective State-Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024

work page 2024

[11] [11]

T. Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024

[12] [12]

Dao and A

T. Dao and A. Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learnin...

work page 2024

[13] [13]

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , pp.\ 16344--16359, 2022

work page 2022

[14] [14]

S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, and C. Gulcehre. Griffin : Mixing gated linear recurrences with local attention for efficient language models. arXiv:2402.19427 [cs.LG], 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 2024. URL https://zenodo.org/records/12608602

work page arXiv 2024

[16] [16]

Grazzi, J

R. Grazzi, J. Siems, A. Zela, J. Franke, F. Hutter, and M. Pontil. Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues . In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025

work page 2025

[17] [17]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear time sequence modeling with selective state spaces. arXiv:2312.00752 [cs.LG], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Re. HiPPO: Recurrent Memory with Optimal Polynomial Projections . In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, and H. Lin (eds.), Proceedings of the 33rd International Conference on Advances in Neural Information Processing Systems ( N eur IPS '20) , 2020

work page 2020

[19] [19]

A. Gu, K. Goel, and C. Re. Efficiently Modeling Long Sequences with Structured State Spaces . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022 a

work page 2022

[20] [20]

A. Gu, A. Gupta, K. Goel, and C. Ré. On the Parameterization and Initialization of Diagonal State Space Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022 b

work page 2022

[21] [21]

F. Harris. On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE, 66 0 (1): 0 51--83, 2005

work page 2005

[22] [22]

Henry, P

A. Henry, P. Dachapally, S. Pawar, and Y. Chen. Query-Key Normalization for Transformers . In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020

work page 2020

[23] [23]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997. Based on TR FKI-207-95, TUM (1995)

work page 1997

[24] [24]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models. In S. Koyejo, S. Mohamed, A. ...

work page 2022

[25] [25]

J. Hu, Y. Pan, J. Du, D. Lan, X. Tang, Q. Wen, Y. Liang, and W. Sun. Comba: Improving Bilinear RNNs with Closed-loop Control . arXiv:2506.02475 [cs.LG], 2025

work page arXiv 2025

[26] [26]

Smith III

J. Smith III. Spectral audio signal processing. (No Title), 2011

work page 2011

[27] [27]

Jelassi, D

S. Jelassi, D. Brandfonbrener, S. Kakade, and E. Malach. Repeat After Me: Transformers are Better than State Space Models at Copying . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machin...

work page 2024

[28] [28]

Mistral 7B

A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B . arXiv:2310.06825 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Kacham, V

P. Kacham, V. Mirrokni, and P. Zhong. PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels . arXiv:2310.01655 [cs.LG], 2023

work page arXiv 2023

[30] [30]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention . In H. Daume III and A. Singh (eds.), Proceedings of the 37th International Conference on Machine Learning ( ICML '20) , volume 98. Proceedings of Machine Learning Research, 2020

work page 2020

[31] [31]

Kazemnejad, I

A. Kazemnejad, I. Padhi, K. Natesan, P. Das, and S. Reddy. The impact of positional encoding on length generalization in transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023

work page 2023

[32] [32]

Salimans D

T. Salimans D. Kingma. Weight Normalization : A simple reparameterization to accelerate training of deep neural networks. In D. Lee, M. Sugiyama, U. von Luxburg , I. Guyon, and R. Garnett (eds.), Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '16) , volume 29, 2016

work page 2016

[33] [33]

S. Li, C. You, G. Guruganesh, J. Ainslie, S. Ontanon, M. Zaheer, S. Sanghai, Y. Yang, S. Kumar, and S. Bhojanapalli. Functional Interpolation for Relative Positions Improves Long Context Transformers . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024

[34] [34]

Z. Lin, E. Nikishin, X. He, and A. Courville. Forgetting Transformer: Softmax Attention with a Forget Gate . In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025

work page 2025

[35] [35]

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang. Transformers Learn Shortcuts to Automata . In The Eleventh International Conference on Learning Representations ( ICLR '23) . ICLR, 2023

work page 2023

[36] [36]

Loshchilov and F

I. Loshchilov and F. Hutter. SGDR : Stochastic gradient descent with warm restarts. In The Fifth International Conference on Learning Representations ( ICLR '17) . ICLR, 2017

work page 2017

[37] [37]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations ( ICLR '19) . ICLR, 2019

work page 2019

[38] [38]

Martin and C

E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018

work page 2018

[39] [39]

Merrill, J

W. Merrill, J. Petty, and A. Sabharwal. The Illusion of State in State-Space Models . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Research. PMLR, 2024

work page 2024

[40] [40]

2025 , archivePrefix=

D. Okpekpe and A. Orvieto. When recalling in-context, Transformers are not SSMs . arXiv:2508.19029 [cs.LG], 2025

work page arXiv 2025

[41] [41]

Oppenheim

A. Oppenheim. Discrete-time signal processing. Pearson Education India, 1999

work page 1999

[42] [42]

Orvieto and R

A. Orvieto and R. Gower. In search of adam's secret sauce. arXiv:2505.21829 [cs.LG], 2025

work page arXiv 2025

[43] [43]

Orvieto, S

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences . In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning ( ICML '23) , volume 202 of Proceedings of Machine Learning R...

work page 2023

[44] [44]

Orvieto, S

A. Orvieto, S. De, C. Gulcehre, R. Pascanu, and S. Smith. Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Le...

work page 2024

[45] [45]

Penedo, H

G. Penedo, H. Kydl \' c ek, L. Ben allal, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information P...

work page 2024

[46] [46]

B. Peng, J. Quesnelle, H. Fan, and E. Shippole. YaRN: Efficient Context Window Extension of Large Language Models . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024

[47] [47]

B. Peng, R. Zhang, D. Goldstein, E. Alcaide, X. Du, H. Hou, J. Lin, J. Liu, J. Lu, W. Merrill, G. Song, K. Tan, S. Utpala, N. Wilce, J. Wind, T. Wu, D. Wuttke, and C. Zhou-Zheng. RWKV-7 "Goose" with Expressive Dynamic State Evolution . arXiv:2503.14456 [cs.CL], 2025

work page arXiv 2025

[48] [48]

H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and L. Kong. Random Feature Attention . In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021

work page 2021

[49] [49]

M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Bj \"o" rn Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conferenc...

work page 2024

[50] [50]

Z. Qin, S. Yang, and Y. Zhong. Hierarchically Gated Recurrent Neural Network for Sequence Modeling . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023

work page 2023

[51] [51]

Z. Qin, S. Yang, W. Sun, X. Shen, D. Li, W. Sun, and Y. Zhong. HGRN2: Gated Linear RNNs with State Expansion . arXiv:2404.07904 [cs.CL], 2024

work page arXiv 2024

[52] [52]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In K. Chaudhuri and R. Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning ( ICML '19) , volume 97. Proceedings of Machine Learning Research, 2019

work page 2019

[53] [53]

Rahimi and B

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Proceedings of the 21st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '07) , 2007

work page 2007

[54] [54]

Ran-Milo, E

Y. Ran-Milo, E. Lumbroso, E. Cohen-Karlik, R. Giryes, A. Globerson, and N. Cohen. Provable Benefits of Complex Parameterizations for Structured State Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Syste...

work page 2024

[55] [55]

Schlag, K

I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers . In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning ( ICML '21) , volume 139 of Proceedings of Machine Learning Research. PMLR, 2021

work page 2021

[56] [56]

P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv:1803.02155 [cs.CL], 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[57] [57]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, and R. Grazzi. DeltaProduct : Increasing the expressivity of deltanet through products of householders. arXiv:2502.10297 [cs.LG], 2025

work page arXiv 2025

[58] [58]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864 [cs.CL], 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[59] [59]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive Network: A Successor to Transformer for Large Language Models . arXiv:2307.08621 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi \` e re, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA : Open and efficient foundation language models. arXiv:2302.13971 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '17) . Curran ...

work page 2017

[62] [62]

An Empirical Study of Mamba-based Language Models

R. Waleffe, W. Byeon, D. Riach, B. Norick, V. Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, G. Kulshreshtha, V. Singh, J. Casper, J. Kautz, M. Shoeybi, and B. Catanzaro. An Empirical Study of Mamba-based Language Models . arXiv:2406.07887 [cs.LG], 2024

work page internal anchor Pith review arXiv 2024

[63] [63]

Widrow, , and M

B. Widrow, , and M. E. Hoff. Adaptive switching circuits, pp.\ 123–134. MIT Press, Cambridge, MA, USA, 1988

work page 1988

[64] [64]

Yang and Y

S. Yang and Y. Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention

work page 2024

[65] [65]

S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated Linear Attention Transformers with Hardware-Efficient Training . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Rese...

work page 2024

[66] [66]

S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024 b

work page 2024

[67] [67]

S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025 a

work page 2025

[68] [68]

S. Yang, Y. Shen, K. Wen, S. Tan, M. Mishra, L. Ren, R. Panda, and Y. Kim. PaTH Attention : Position encoding via accumulating householder transformations. arXiv:2505.16381 [cs.CL], 2025 b

work page arXiv 2025

[69] [69]

Zhang, K

M. Zhang, K. Bhatia, H. Kumbong, and C. Ré. The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024

work page 2024

[70] [70]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page