Selective Rotary Position Embedding
Pith reviewed 2026-05-17 20:33 UTC · model grok-4.3
The pith
Selective RoPE replaces fixed rotation angles in position embeddings with input-dependent ones that work for both linear and softmax transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Selective RoPE is an input-dependent rotary embedding mechanism that generalizes RoPE and enables rotation in arbitrary angles for both linear and softmax transformers, with the observation that softmax attention already performs a hidden form of these rotations on query-key pairs while the real part manages forgetting and the imaginary part encodes positions through rotations in state-space models and gated linear transformers.
What carries the argument
Selective RoPE, the input-dependent rotary embedding that computes rotation angles from the current input rather than fixing them in advance.
If this is right
- Gated transformers equipped with Selective RoPE achieve better performance on language modeling.
- The approach yields improvements on difficult sequence tasks such as copying, state tracking, and retrieval.
- Softmax attention implicitly applies input-dependent rotations to query-key pairs.
- In state-space models and gated linear transformers, the real part handles forgetting while the imaginary part encodes positions via rotations.
Where Pith is reading between the lines
- The method could be applied to non-gated transformers to check whether the gains extend beyond gated architectures.
- Combining Selective RoPE with other selective state mechanisms might produce hybrids that handle longer contexts more efficiently.
- Testing on much longer sequences would reveal whether input-dependent angles reduce position-related degradation better than fixed embeddings.
- The implicit rotational structure uncovered in attention might prompt new ways to interpret order capture in transformers without explicit positional signals.
Load-bearing premise
Input-dependent rotations will reliably improve performance on language modeling and sequence tasks without introducing instability, overfitting, or requiring extensive hyperparameter tuning across different model scales.
What would settle it
Training gated transformers with Selective RoPE on standard language modeling and sequence benchmarks and finding no consistent gains or increased instability compared to fixed RoPE would disprove the claimed benefits.
Figures
read the original abstract
Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Selective RoPE, an input-dependent rotary position embedding mechanism that generalizes standard fixed-angle RoPE to support arbitrary rotation angles. It applies this to both linear and softmax transformers, claims that standard softmax attention implicitly performs a form of these input-dependent rotations on query-key pairs, and shows that in state-space and gated linear models the real/imaginary parts separately handle forgetting and positional encoding. The method is validated by integrating Selective RoPE into gated transformers, with reported empirical improvements on language modeling and sequence tasks including copying, state tracking, and retrieval.
Significance. If the input-dependent rotations can be shown to preserve (or explicitly relax) RoPE's relative-position inductive bias while delivering the claimed gains, the work would usefully connect rotary embeddings with selective mechanisms already successful in linear attention. The observation that softmax attention performs hidden rotations is potentially insightful for understanding implicit positional structure, and the empirical results on retrieval and state-tracking tasks suggest practical value for long-context modeling if the gains hold under controlled ablations.
major comments (2)
- [Introduction and §3] Introduction and §3 (method definition): the claim that Selective RoPE 'generalizes RoPE' and 'enables rotation in arbitrary angles' is not accompanied by an explicit statement or proof that the input-dependent angle θ_i(x_m, x_n) still yields an effective rotation depending only on relative offset (m-n). Without such a constraint or derivation, the attention score loses the translation invariance that is the core inductive bias of standard RoPE (Eq. (1) in the original RoPE formulation). This directly affects whether gains on copying/retrieval will transfer to standard language modeling.
- [§4.2] §4.2 (empirical validation): the reported improvements on language modeling and sequence tasks lack ablations that isolate the effect of input-dependent angles from other changes in the gated transformer architecture. In particular, it is unclear whether performance gains persist when the selective angles are replaced by fixed but learned angles, which would test whether the input-dependence itself (rather than simply more flexible rotations) is load-bearing.
minor comments (2)
- Notation for the selective angle function should be introduced once and used consistently; currently the dependence on both query and key (or on single token) is described differently across the abstract, introduction, and method sections.
- The statement that 'softmax attention already performs a hidden form of these rotations' would benefit from a short derivation or explicit mapping to the standard QK dot-product under RoPE, rather than leaving it as an observation.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper 'Selective Rotary Position Embedding'. We have carefully considered the major comments and provide point-by-point responses below, along with planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Introduction and §3] Introduction and §3 (method definition): the claim that Selective RoPE 'generalizes RoPE' and 'enables rotation in arbitrary angles' is not accompanied by an explicit statement or proof that the input-dependent angle θ_i(x_m, x_n) still yields an effective rotation depending only on relative offset (m-n). Without such a constraint or derivation, the attention score loses the translation invariance that is the core inductive bias of standard RoPE (Eq. (1) in the original RoPE formulation). This directly affects whether gains on copying/retrieval will transfer to standard language modeling.
Authors: We thank the referee for highlighting this important point regarding the inductive bias. Upon reflection, the current formulation of Selective RoPE allows the rotation angle to depend on the specific input tokens x_m and x_n, which indeed means it does not strictly enforce dependence only on the relative position (m-n) as in standard RoPE. This is intentional to introduce selectivity similar to gating mechanisms. However, we agree that an explicit discussion or derivation is missing. In the revised manuscript, we will add a clarification in Section 3 explaining how the input-dependent rotations relate to relative positions, including any preserved or relaxed properties, and discuss implications for transfer to language modeling tasks. We believe this will address the concern while maintaining the novelty of the selective approach. revision: yes
-
Referee: [§4.2] §4.2 (empirical validation): the reported improvements on language modeling and sequence tasks lack ablations that isolate the effect of input-dependent angles from other changes in the gated transformer architecture. In particular, it is unclear whether performance gains persist when the selective angles are replaced by fixed but learned angles, which would test whether the input-dependence itself (rather than simply more flexible rotations) is load-bearing.
Authors: We agree that isolating the input-dependence is crucial for validating the contribution of Selective RoPE. The current experiments integrate Selective RoPE into gated transformers but do not include the suggested ablation with fixed learned angles. We will add these ablations in the revised version, comparing Selective RoPE against variants with fixed but learned rotation angles on the language modeling, copying, state tracking, and retrieval tasks. This will help demonstrate whether the dynamic, input-dependent nature provides additional benefits beyond increased flexibility. revision: yes
Circularity Check
No significant circularity; Selective RoPE introduced as independent generalization
full rationale
The provided abstract and description present Selective RoPE as a novel input-dependent rotary mechanism that generalizes fixed-angle RoPE and reveals implicit rotations already latent in softmax attention. No load-bearing derivation step is shown to reduce by construction to a fitted input, self-citation chain, or renamed ansatz. The claims rest on the proposed mechanism's ability to enable arbitrary-angle rotations for both linear and softmax transformers, with empirical validation on language modeling and sequence tasks offered as external support rather than tautological prediction. The derivation chain is therefore self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
A performant linear transformer requires both: (a) rotation and (b) gating.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Beyond ZOH: Advanced Discretization Strategies for Vision Mamba
Bilinear discretization improves Vision Mamba accuracy over zero-order hold on classification, segmentation, and detection benchmarks with only modest extra training cost.
Reference graph
Works this paper leans on
-
[1]
plainlm: Language model pretraining in pytorch
Niccolò Ajroldi. plainlm: Language model pretraining in pytorch. https://github.com/Niccolo-Ajroldi/plainLM, 2024
work page 2024
- [2]
-
[3]
S. Arora, S. Eyuboglu, M. Zhang, A. Timalsina, S. Alberti, J. Zou, A. Rudra, and C. Re. Simple linear attention language models balance the recall-throughput tradeoff . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , v...
work page 2024
-
[4]
M. Beck, K. P \"o ppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. xLSTM: Extended Long Short-Term Memory . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems (...
work page 2024
-
[5]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model
S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, J. Phang, M. Pieler, U. Prashanth, S. Purohit, L. Reynolds, J. Tow, B. Wang, and S. Weinbach. GPT-NeoX-20B : An open-source autoregressive language model. arXiv:2204.06745 [cs.CL], 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
E. Cand\`es and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communications on Pure and Applied Mathematics, 67 0 (6): 0 906--956, 2014. doi:https://doi.org/10.1002/cpa.21455. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21455
-
[7]
T.-C. Chi, T.-H. Fan, P. Ramadge, and A. Rudnicky. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022
work page 2022
-
[8]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021
work page 2021
-
[9]
K. Choromanski, H. Chen, H. Lin Y. Ma, A. Sehanobish, D. Jain, M. Ryoo, J. Varley, A. Zeng, V. Likhosherstov, D. Kalashnikov, V. Sindhwani, and A. Weller. Hybrid Random Features . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022
work page 2022
-
[10]
N. M. Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons. Theoretical Foundations of Deep Selective State-Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024
work page 2024
-
[11]
T. Dao. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024
work page 2024
-
[12]
T. Dao and A. Gu. Transformers are SSM s: Generalized models and efficient algorithms through structured state space duality. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learnin...
work page 2024
-
[13]
T. Dao, D. Fu, S. Ermon, A. Rudra, and C. R \'e . Flash A ttention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , pp.\ 16344--16359, 2022
work page 2022
-
[14]
S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, and C. Gulcehre. Griffin : Mixing gated linear recurrences with local attention for efficient language models. arXiv:2402.19427 [cs.LG], 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac'h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. The language model evaluation harness, 2024. URL https://zenodo.org/records/12608602
- [16]
-
[17]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao. Mamba: Linear time sequence modeling with selective state spaces. arXiv:2312.00752 [cs.LG], 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Re. HiPPO: Recurrent Memory with Optimal Polynomial Projections . In H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, and H. Lin (eds.), Proceedings of the 33rd International Conference on Advances in Neural Information Processing Systems ( N eur IPS '20) , 2020
work page 2020
-
[19]
A. Gu, K. Goel, and C. Re. Efficiently Modeling Long Sequences with Structured State Spaces . In The Tenth International Conference on Learning Representations ( ICLR '22) . ICLR, 2022 a
work page 2022
-
[20]
A. Gu, A. Gupta, K. Goel, and C. Ré. On the Parameterization and Initialization of Diagonal State Space Models . In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '22) , 2022 b
work page 2022
-
[21]
F. Harris. On the use of windows for harmonic analysis with the discrete fourier transform. Proceedings of the IEEE, 66 0 (1): 0 51--83, 2005
work page 2005
-
[22]
A. Henry, P. Dachapally, S. Pawar, and Y. Chen. Query-Key Normalization for Transformers . In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020
work page 2020
-
[23]
S. Hochreiter and J. Schmidhuber. Long Short-Term Memory . Neural Computation, 9 0 (8): 0 1735--1780, 1997. Based on TR FKI-207-95, TUM (1995)
work page 1997
-
[24]
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and L. Sifre. Training compute-optimal large language models. In S. Koyejo, S. Mohamed, A. ...
work page 2022
- [25]
- [26]
-
[27]
S. Jelassi, D. Brandfonbrener, S. Kakade, and E. Malach. Repeat After Me: Transformers are Better than State Space Models at Copying . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machin...
work page 2024
-
[28]
A. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7B . arXiv:2310.06825 [cs.CL], 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [29]
-
[30]
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention . In H. Daume III and A. Singh (eds.), Proceedings of the 37th International Conference on Machine Learning ( ICML '20) , volume 98. Proceedings of Machine Learning Research, 2020
work page 2020
-
[31]
A. Kazemnejad, I. Padhi, K. Natesan, P. Das, and S. Reddy. The impact of positional encoding on length generalization in transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023
work page 2023
-
[32]
T. Salimans D. Kingma. Weight Normalization : A simple reparameterization to accelerate training of deep neural networks. In D. Lee, M. Sugiyama, U. von Luxburg , I. Guyon, and R. Garnett (eds.), Proceedings of the 30th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '16) , volume 29, 2016
work page 2016
-
[33]
S. Li, C. You, G. Guruganesh, J. Ainslie, S. Ontanon, M. Zaheer, S. Sanghai, Y. Yang, S. Kumar, and S. Bhojanapalli. Functional Interpolation for Relative Positions Improves Long Context Transformers . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024
work page 2024
-
[34]
Z. Lin, E. Nikishin, X. He, and A. Courville. Forgetting Transformer: Softmax Attention with a Forget Gate . In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025
work page 2025
-
[35]
B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang. Transformers Learn Shortcuts to Automata . In The Eleventh International Conference on Learning Representations ( ICLR '23) . ICLR, 2023
work page 2023
-
[36]
I. Loshchilov and F. Hutter. SGDR : Stochastic gradient descent with warm restarts. In The Fifth International Conference on Learning Representations ( ICLR '17) . ICLR, 2017
work page 2017
-
[37]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In The Seventh International Conference on Learning Representations ( ICLR '19) . ICLR, 2019
work page 2019
-
[38]
E. Martin and C. Cundy. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations, 2018
work page 2018
-
[39]
W. Merrill, J. Petty, and A. Sabharwal. The Illusion of State in State-Space Models . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[40]
D. Okpekpe and A. Orvieto. When recalling in-context, Transformers are not SSMs . arXiv:2508.19029 [cs.LG], 2025
- [41]
-
[42]
A. Orvieto and R. Gower. In search of adam's secret sauce. arXiv:2505.21829 [cs.LG], 2025
-
[43]
A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences . In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning ( ICML '23) , volume 202 of Proceedings of Machine Learning R...
work page 2023
-
[44]
A. Orvieto, S. De, C. Gulcehre, R. Pascanu, and S. Smith. Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Le...
work page 2024
-
[45]
G. Penedo, H. Kydl \' c ek, L. Ben allal, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information P...
work page 2024
-
[46]
B. Peng, J. Quesnelle, H. Fan, and E. Shippole. YaRN: Efficient Context Window Extension of Large Language Models . In The Twelfth International Conference on Learning Representations ( ICLR '24) . ICLR, 2024
work page 2024
- [47]
-
[48]
H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and L. Kong. Random Feature Attention . In The Ninth International Conference on Learning Representations ( ICLR '21) . ICLR, 2021
work page 2021
-
[49]
M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Bj \"o" rn Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conferenc...
work page 2024
-
[50]
Z. Qin, S. Yang, and Y. Zhong. Hierarchically Gated Recurrent Neural Network for Sequence Modeling . In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '23) , 2023
work page 2023
- [51]
-
[52]
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks. In K. Chaudhuri and R. Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning ( ICML '19) , volume 97. Proceedings of Machine Learning Research, 2019
work page 2019
-
[53]
A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis (eds.), Proceedings of the 21st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '07) , 2007
work page 2007
-
[54]
Y. Ran-Milo, E. Lumbroso, E. Cohen-Karlik, R. Giryes, A. Globerson, and N. Cohen. Provable Benefits of Complex Parameterizations for Structured State Space Models . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Syste...
work page 2024
-
[55]
I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers . In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning ( ICML '21) , volume 139 of Proceedings of Machine Learning Research. PMLR, 2021
work page 2021
-
[56]
P. Shaw, J. Uszkoreit, and A. Vaswani. Self-attention with relative position representations. arXiv:1803.02155 [cs.CL], 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[57]
Deltaproduct: Im- proving state-tracking in linear rnns via householder products
J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, and R. Grazzi. DeltaProduct : Increasing the expressivity of deltanet through products of householders. arXiv:2502.10297 [cs.LG], 2025
-
[58]
J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864 [cs.CL], 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[59]
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive Network: A Successor to Transformer for Large Language Models . arXiv:2307.08621 [cs.CL], 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozi \` e re, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA : Open and efficient foundation language models. arXiv:2302.13971 [cs.CL], 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems ( N eur IPS '17) . Curran ...
work page 2017
-
[62]
An Empirical Study of Mamba-based Language Models
R. Waleffe, W. Byeon, D. Riach, B. Norick, V. Korthikanti, T. Dao, A. Gu, A. Hatamizadeh, S. Singh, D. Narayanan, G. Kulshreshtha, V. Singh, J. Casper, J. Kautz, M. Shoeybi, and B. Catanzaro. An Empirical Study of Mamba-based Language Models . arXiv:2406.07887 [cs.LG], 2024
work page internal anchor Pith review arXiv 2024
-
[63]
B. Widrow, , and M. E. Hoff. Adaptive switching circuits, pp.\ 123–134. MIT Press, Cambridge, MA, USA, 1988
work page 1988
-
[64]
S. Yang and Y. Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URL https://github.com/fla-org/flash-linear-attention
work page 2024
-
[65]
S. Yang, B. Wang, Y. Shen, R. Panda, and Y. Kim. Gated Linear Attention Transformers with Hardware-Efficient Training . In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning ( ICML '24) , volume 251 of Proceedings of Machine Learning Rese...
work page 2024
-
[66]
S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length . In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems ( N eur IPS '24) , 2024 b
work page 2024
-
[67]
S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. In The Thirteenth International Conference on Learning Representations ( ICLR '25) . ICLR, 2025 a
work page 2025
- [68]
- [69]
-
[70]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.