mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling

Christian Metzner; Jimmy Weber; Karthik Charan Raghunathan; Laura Kriener; Melika Payvand; Sebastian Billaudelle; Tristan Torchet

arxiv: 2507.01829 · v2 · submitted 2025-07-02 · 💻 cs.LG · cs.AI

mGRADE: Minimal Recurrent Gating Meets Delay Convolutions for Lightweight Sequence Modeling

Tristan Torchet , Christian Metzner , Karthik Charan Raghunathan , Jimmy Weber , Sebastian Billaudelle , Laura Kriener , Melika Payvand This is my paper

Pith reviewed 2026-05-19 05:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sequence modelingdelay embeddinggated recurrent networkslightweight architectureslong-range dependenciesmulti-timescale modelingedge-device inferenceconvolutional recurrence

0 comments

The pith

mGRADE combines learnable delay embeddings with minimal gated recurrence to model fast and slow sequence dynamics inside a fixed memory budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents mGRADE as a hybrid architecture for multi-timescale sequence modeling that must stay within the tight memory limits of edge devices. It pairs a convolution whose temporal spacings can be learned with a lightweight gated recurrent unit. Theory shows the learnable spacings act as a delay embedding that reconstructs partially observed fast dynamics efficiently. The gated recurrent part keeps long-range context with almost no extra memory cost. Benchmarks on the Long-Range Arena and raw-audio speech commands confirm competitive accuracy at up to eight times lower memory than prior state-of-the-art models.

Core claim

mGRADE integrates a convolution with learnable temporal spacings, proven equivalent to a delay embedding, and a minimally gated recurrent component; this combination reconstructs fast dynamics parameter-efficiently while selectively retaining long-range context, all within a constant memory footprint that is up to eight times smaller than existing models on long-range tasks.

What carries the argument

The learnable temporal spacings inside the delay convolution, shown to be equivalent to a delay embedding that enables parameter-efficient reconstruction of fast dynamics, together with the minimal gated recurrent unit that maintains long-range context.

If this is right

Models can now handle both fast local dynamics and slow global context without expanding memory footprint.
Parameter count for fast-dynamics reconstruction drops because the delay-embedding equivalence removes the need for explicit high-dimensional state.
Long-range selectivity is preserved by the gated recurrent unit while memory overhead stays near-constant.
The architecture scales to edge-device constraints where prior constant-memory models had to sacrifice either speed or range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar delay-convolution plus minimal-gating blocks could be inserted into existing transformers to reduce their KV-cache size on long sequences.
The equivalence to delay embeddings suggests analytic stability bounds might be derivable for the combined system.
On streaming sensor data the same inductive bias could allow accurate reconstruction from sparse, irregularly timed observations.

Load-bearing premise

The theoretical equivalence between learnable temporal spacings and delay embeddings will produce real gains in reconstruction accuracy and memory use without hidden costs to training stability or generalization on real data.

What would settle it

An experiment on the Long-Range Arena where mGRADE either exceeds the memory budget of competing models or shows clearly lower accuracy when both are forced to the same small memory limit.

Figures

Figures reproduced from arXiv: 2507.01829 by Christian Metzner, Jimmy Weber, Karthik Charan Raghunathan, Laura Kriener, Melika Payvand, Sebastian Billaudelle, Tristan Torchet.

**Figure 2.** Figure 2: mGRADE-L reconstructs a diffeomorphic mapping of the input dynamics. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: mGRADE-L predictively models flip-flop languages. A) [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Memory footprint decomposition of TCN-EID and Accuracy vs Mem [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Multi-timescale sequence modeling relies on capturing both local fast dynamics and global slow context; yet, maintaining these capabilities under the strict memory constraints common to edge devices remains an open challenge. Current State-of-the-Art models with constant memory footprints trade off long-range selectivity and high-precision modeling of fast dynamics. To overcome this trade-off within a fixed memory budget, we propose mGRADE (minimally Gated Recurrent Architecture with Delay Embedding), a hybrid-memory system that introduces inductive biases across timescales by integrating a convolution with learnable temporal spacings with a lightweight gated recurrent component. We show theoretically that the learnable spacings are equivalent to a delay embedding, enabling parameter-efficient reconstruction of partially-observed fast dynamics, while the gated recurrent component selectively maintains long-range context with minimal memory overhead. On the challenging Long-Range Arena benchmark and 35-way Google Speech Commands raw audio classification task, mGRADE reduces the memory footprint by up to a factor of 8 compared to other State-of-the-Art models, while maintaining competitive performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

mGRADE pairs learnable-spaced convolutions with minimal gated recurrence for strong memory savings on sequence tasks, but the delay-embedding equivalence lacks direct reconstruction tests.

read the letter

The core of this paper is a hybrid that adds convolutions with learnable temporal spacings to a minimal gated recurrent unit. The authors frame the spacings as equivalent to a delay embedding, which they say allows efficient reconstruction of fast dynamics while the gate handles long-range context with low memory overhead. On LRA and the 35-way Speech Commands task they report up to 8x memory reduction versus other constant-memory models while staying competitive on accuracy. That practical result is the clearest contribution here.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces mGRADE, a hybrid sequence model that pairs a convolution with learnable temporal spacings and a lightweight gated recurrent component. It asserts a theoretical equivalence between the learnable spacings and delay embeddings that enables parameter-efficient reconstruction of partially observed fast dynamics, while the recurrent gate maintains long-range context under tight memory budgets. Competitive accuracy is reported on the Long-Range Arena benchmark and the 35-way Google Speech Commands raw-audio task, together with up to 8× memory reduction relative to prior constant-memory SOTA models.

Significance. If the claimed equivalence is rigorously established and the reconstruction benefit is observable in practice, the architecture could meaningfully advance memory-constrained multi-timescale modeling. The hybrid inductive bias directly targets a recognized trade-off between long-range selectivity and high-precision fast dynamics. The absence of a derivation, ablation evidence, and direct reconstruction metrics, however, leaves the central efficiency argument unverified at present.

major comments (2)

[Abstract / §3] Abstract and §3 (theoretical claim): the statement that 'the learnable spacings are equivalent to a delay embedding' is presented as a theoretical result enabling parameter-efficient reconstruction, yet no derivation, proof sketch, or set of embedding conditions (dimension, separation, attractor coverage) is supplied. Because this equivalence is load-bearing for the headline memory-efficiency argument, its absence prevents assessment of whether the learned spacings satisfy the necessary conditions or merely approximate them.
[§4] §4 (experiments): only end-task accuracy and memory footprint are reported on LRA and Speech Commands. No ablation isolating the delay-embedding component, no error bars across random seeds, and no direct reconstruction-error measurements on partially observed trajectories are provided. Without these, it is impossible to confirm that the theoretical equivalence yields measurable reconstruction gains rather than incidental performance.

minor comments (2)

[Title / Abstract] The acronym expansion in the title ('Minimal Recurrent Gating Meets Delay Convolutions') differs slightly from the abstract ('minimally Gated Recurrent Architecture with Delay Embedding'); a consistent expansion would improve clarity.
[§2 / §3] Notation for the learnable spacings and the gated recurrent state should be introduced once with explicit dimensions and update equations to avoid ambiguity when the two components are combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major point below, clarifying the theoretical equivalence and committing to strengthened experimental validation in the revision.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (theoretical claim): the statement that 'the learnable spacings are equivalent to a delay embedding' is presented as a theoretical result enabling parameter-efficient reconstruction, yet no derivation, proof sketch, or set of embedding conditions (dimension, separation, attractor coverage) is supplied. Because this equivalence is load-bearing for the headline memory-efficiency argument, its absence prevents assessment of whether the learned spacings satisfy the necessary conditions or merely approximate them.

Authors: We agree that an explicit derivation strengthens the central claim. Section 3 derives the equivalence by showing that convolution kernels with learnable spacings implement a non-uniform delay embedding: each output channel samples the input at a learned offset, reconstructing the unobserved fast state from partial observations under the conditions of Takens' theorem (sufficient embedding dimension and separation). We will add a concise proof sketch, the precise embedding-dimension bound, and a statement of the attractor-coverage assumption to §3 in the revision. revision: yes
Referee: [§4] §4 (experiments): only end-task accuracy and memory footprint are reported on LRA and Speech Commands. No ablation isolating the delay-embedding component, no error bars across random seeds, and no direct reconstruction-error measurements on partially observed trajectories are provided. Without these, it is impossible to confirm that the theoretical equivalence yields measurable reconstruction gains rather than incidental performance.

Authors: We accept that additional controls are needed. In the revision we will report mean and standard deviation over five random seeds for all LRA and Speech Commands results. We will also add an ablation that replaces learnable spacings with fixed uniform spacings while keeping the gated recurrent component unchanged, thereby isolating the delay-embedding contribution. Direct reconstruction error on synthetic partially observed trajectories will be included in an appendix to quantify the practical benefit of the equivalence. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no reduction to fitted inputs or self-citations

full rationale

The central theoretical claim is that learnable temporal spacings in the convolution are equivalent to a delay embedding, shown as a general mathematical result rather than derived from or fitted to the experimental outcomes. This equivalence is invoked to explain parameter efficiency for fast dynamics reconstruction, but the paper does not define the spacings in terms of the embedding or vice versa, nor does it rename a fitted quantity as a prediction. Benchmarks on LRA and Speech Commands report end-task metrics independently of any self-referential proof. No load-bearing step reduces by construction to the inputs; the derivation remains independent of the specific learned values and does not rely on self-citation chains for uniqueness or ansatz.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven-in-abstract equivalence of learnable spacings to delay embeddings and on the empirical claim that the hybrid stays competitive under fixed memory; both rest on standard convolution and recurrence assumptions plus new learnable parameters.

free parameters (1)

learnable temporal spacings
Spacings in the convolution are learned from data to realize the delay embedding.

axioms (1)

domain assumption Learnable spacings in convolution are mathematically equivalent to a delay embedding
Invoked to justify parameter-efficient reconstruction of fast dynamics.

pith-pipeline@v0.9.0 · 5735 in / 1227 out tokens · 35529 ms · 2026-05-19T05:48:14.823023+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show theoretically that the learnable spacings are equivalent to a delay embedding, enabling parameter-efficient reconstruction of partially-observed fast dynamics
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mGRADE-L stores the value after the last w in one part of its hidden state while the other merely reproduces the input. Learnable delays trigger updates...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 3 internal anchors

[1]

o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, G \

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. x LSTM : Extended long short-term memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=ARAxPPIAhq

work page 2024
[2]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5 0 (2): 0 157--166, 1994. doi:10.1109/72.279181

work page doi:10.1109/72.279181 1994
[3]

On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages

Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096--7116, Online, November 2020. Association for Compu...

work page doi:10.18653/v1/2020.emnlp-main.576 2020
[4]

Minimalist: switched-capacitor circuits for efficient in-memory computation of gated recurrent units, 2025

Sebastian Billaudelle, Laura Kriener, Filippo Moro, Tristan Torchet, and Melika Payvand. Minimalist: switched-capacitor circuits for efficient in-memory computation of gated recurrent units, 2025. URL https://arxiv.org/abs/2505.08599

work page arXiv 2025
[5]

Blelloch

Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990

work page 1990
[6]

Quasi-recurrent neural networks

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1zJ-v5xl

work page 2017
[7]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018
[8]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder -- decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural ...

work page doi:10.3115/v1/d14-1179 2014
[9]

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014

work page 2014
[10]

Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052, 2022

Tri Dao, Daniel Y. Fu, Khaled Kamal Saab, Armin W. Thomas, Atri Rudra, and Christopher R \' e . Hungry hungry hippos: Towards language modeling with state space models. CoRR, abs/2212.14052, 2022. doi:10.48550/arXiv.2212.14052. URL https://doi.org/10.48550/arXiv.2212.14052

work page doi:10.48550/arxiv.2212.14052 2022
[11]

Were RNN s all we needed?, 2025

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimirsadeghi. Were RNN s all we needed?, 2025. URL https://openreview.net/forum?id=GrmFFxGnOR

work page 2025
[12]

Learning to forget: Continual prediction with lstm

Felix A. Gers, J \" u rgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with LSTM . Neural Comput., 12 0 (10): 0 2451--2471, 2000. doi:10.1162/089976600300015015. URL https://doi.org/10.1162/089976600300015015

work page doi:10.1162/089976600300015015 2000
[13]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R\'e. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations ( ICLR ) , 2022

work page 2022
[14]

Learning delays in spiking neural networks using dilated convolutions with learnable spacings

Ilyass Hammouamri, Ismail Khalfaoui-Hassani, and Timoth \'e e Masquelier. Learning delays in spiking neural networks using dilated convolutions with learnable spacings. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4r2ybzJnmN

work page 2024
[15]

Dilated convolution with learnable spacings

Ismail Khalfaoui Hassani, Thomas Pellegrini, and Timoth \'e e Masquelier. Dilated convolution with learnable spacings. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Q3-1vRh3HOA

work page 2023
[16]

F lax: A neural network library and ecosystem for JAX , 2024

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. F lax: A neural network library and ecosystem for JAX , 2024. URL http://github.com/google/flax

work page 2024
[17]

Long short-term memory

Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[18]

Generalizations across filler-gap dependencies in neural language models

Katherine Howitt, Sathvik Nair, Allison Dods, and Robert Melvin Hopkins. Generalizations across filler-gap dependencies in neural language models. In Libby Barak and Malihe Alikhani, editors, Proceedings of the 28th Conference on Computational Natural Language Learning, pages 269--279, Miami, FL, USA, November 2024. Association for Computational Linguisti...

work page doi:10.18653/v1/2024.conll-1.21 2024
[19]

Hyndman and Anne B

Rob J. Hyndman and Anne B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22 0 (4): 0 679--688, 2006. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2006.03.001. URL https://www.sciencedirect.com/science/article/pii/S0169207006000239

work page doi:10.1016/j.ijforecast.2006.03.001 2006
[20]

Learning delays in spiking neural networks using dilated convolutions with learnable spacings

Ismail Khalfaoui-Hassani, Thomas Pellegrini, and Timoth \'e e Masquelier. Learning delays in spiking neural networks using dilated convolutions with learnable spacings. In Differentiable Almost Everything Workshop of the 40-th International Conference on Machine Learning, 2023. URL https://arxiv.org/abs/2306.00817

work page arXiv 2023
[21]

Algebraic theory of machines, 1965

Kenneth Krohn and John Rhodes. Algebraic theory of machines, 1965

work page 1965
[22]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/

work page 2010
[23]

What makes convolutional models great on long sequence modeling? In The Eleventh International Conference on Learning Representations, 2023

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. What makes convolutional models great on long sequence modeling? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TGJSPbRpJX-

work page 2023
[24]

Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Exposing attention glitches with flip-flop language modeling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=VzmpXQAn6E

work page 2023
[25]

Deterministic nonperiodic flow

Edward Norton Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20: 0 130–141, 1963

work page 1963
[26]

Parallelizing linear recurrent neural nets over sequence length

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net, 2018. URL https://openreview.net/forum?id=HyUNwulC-

work page 2018
[27]

Context dependent recurrent neural network language model

Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234--239, 2012. doi:10.1109/SLT.2012.6424228

work page doi:10.1109/slt.2012.6424228 2012
[28]

Neural net architectures for temporal sequence processing

Michael Mozer. Neural net architectures for temporal sequence processing. Santa Fe Institute Studies in The Sciences of Complexity, 15: 0 243--243, 03 1993

work page 1993
[29]

Resurrecting recurrent neural networks for long sequences, 2023

Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, C aglar G \" u l c ehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. CoRR, abs/2303.06349, 2023. doi:10.48550/arXiv.2303.06349. URL https://doi.org/10.48550/arXiv.2303.06349

work page doi:10.48550/arxiv.2303.06349 2023
[30]

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L. Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In ICML, 2024. URL https://openreview.net/forum?id=47ahBl70xb

work page 2024
[31]

Delay embedding theory of neural sequence models

Mitchell Ostrow, Adam Eisen, and Ila Fiete. Delay embedding theory of neural sequence models. 2024. URL https://arxiv.org/abs/2406.11993v1

work page arXiv 2024
[32]

Regularization and nonlinearities for neural language models: when are they needed?

Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models: when are they needed?, 2013. URL https://arxiv.org/abs/1301.5650

work page internal anchor Pith review Pith/arXiv arXiv 2013
[33]

Hierarchically gated recurrent neural network for sequence modeling

Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, ...

work page 2023
[34]

The expressive capacity of state space models: A formal language perspective

Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eV5YIrJPdy

work page 2024
[35]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[36]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch\" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

work page 2023
[37]

12.4 Chemical chaos and attractor reconstruction

Steven Strogatz. 12.4 Chemical chaos and attractor reconstruction. CRC Press, 2015

work page 2015
[38]

Detecting strange attractors in turbulence

Floris Takens. Detecting strange attractors in turbulence. Dynamical Systems and Turbulence, Lecture Notes in Mathematics, 898: 0 366–381, 1981

work page 1981
[39]

Selecting embedding delays: An overview of embedding techniques and a new method using persistent homology

Eugene Tan, Shannon Algar, Débora Corrêa, Michael Small, Thomas Stemler, and David Walker. Selecting embedding delays: An overview of embedding techniques and a new method using persistent homology. Chaos: An Interdisciplinary Journal of Nonlinear Science, 33 0 (3): 0 032101, 03 2023. ISSN 1054-1500. doi:10.1063/5.0137223. URL https://doi.org/10.1063/5.0137223

work page doi:10.1063/5.0137223 2023
[40]

Long range arena : A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net...

work page 2021
[41]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio, 2016. URL https://arxiv.org/abs/1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL http...

work page 2017
[43]

Waibel, T

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37 0 (3): 0 328--339, 1989. doi:10.1109/29.21701

work page doi:10.1109/29.21701 1989
[44]

Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. What do RNN language models learn about filler -- gap dependencies? In Tal Linzen, Grzegorz Chrupa a, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 211--221, Brussels, Belgium, November 2018. Associ...

work page doi:10.18653/v1/w18-5423 2018
[45]

Masked feature prediction for self-supervised visual pre-training

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 10809--10819. IEEE , 2022. doi:10.1109/CVPR52688.2022.01055. URL https://doi.org/1...

work page doi:10.1109/cvpr52688.2022.01055 2022

[1] [1]

o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, G \

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael K Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. x LSTM : Extended long short-term memory. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=ARAxPPIAhq

work page 2024

[2] [2]

Learning long-term dependencies with gradient descent is difficult.IEEE Transactions on Neural Networks, 5(2):157–166, 1994

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5 0 (2): 0 157--166, 1994. doi:10.1109/72.279181

work page doi:10.1109/72.279181 1994

[3] [3]

On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages

Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096--7116, Online, November 2020. Association for Compu...

work page doi:10.18653/v1/2020.emnlp-main.576 2020

[4] [4]

Minimalist: switched-capacitor circuits for efficient in-memory computation of gated recurrent units, 2025

Sebastian Billaudelle, Laura Kriener, Filippo Moro, Tristan Torchet, and Melika Payvand. Minimalist: switched-capacitor circuits for efficient in-memory computation of gated recurrent units, 2025. URL https://arxiv.org/abs/2505.08599

work page arXiv 2025

[5] [5]

Blelloch

Guy E. Blelloch. Prefix sums and their applications. Technical Report CMU-CS-90-190, School of Computer Science, Carnegie Mellon University, November 1990

work page 1990

[6] [6]

Quasi-recurrent neural networks

James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasi-recurrent neural networks. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1zJ-v5xl

work page 2017

[7] [7]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018

[8] [8]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merri \"e nboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder -- decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural ...

work page doi:10.3115/v1/d14-1179 2014

[9] [9]

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, 2014

work page 2014

[10] [10]

Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052, 2022

Tri Dao, Daniel Y. Fu, Khaled Kamal Saab, Armin W. Thomas, Atri Rudra, and Christopher R \' e . Hungry hungry hippos: Towards language modeling with state space models. CoRR, abs/2212.14052, 2022. doi:10.48550/arXiv.2212.14052. URL https://doi.org/10.48550/arXiv.2212.14052

work page doi:10.48550/arxiv.2212.14052 2022

[11] [11]

Were RNN s all we needed?, 2025

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimirsadeghi. Were RNN s all we needed?, 2025. URL https://openreview.net/forum?id=GrmFFxGnOR

work page 2025

[12] [12]

Learning to forget: Continual prediction with lstm

Felix A. Gers, J \" u rgen Schmidhuber, and Fred A. Cummins. Learning to forget: Continual prediction with LSTM . Neural Comput., 12 0 (10): 0 2451--2471, 2000. doi:10.1162/089976600300015015. URL https://doi.org/10.1162/089976600300015015

work page doi:10.1162/089976600300015015 2000

[13] [13]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R\'e. Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations ( ICLR ) , 2022

work page 2022

[14] [14]

Learning delays in spiking neural networks using dilated convolutions with learnable spacings

Ilyass Hammouamri, Ismail Khalfaoui-Hassani, and Timoth \'e e Masquelier. Learning delays in spiking neural networks using dilated convolutions with learnable spacings. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4r2ybzJnmN

work page 2024

[15] [15]

Dilated convolution with learnable spacings

Ismail Khalfaoui Hassani, Thomas Pellegrini, and Timoth \'e e Masquelier. Dilated convolution with learnable spacings. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Q3-1vRh3HOA

work page 2023

[16] [16]

F lax: A neural network library and ecosystem for JAX , 2024

Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. F lax: A neural network library and ecosystem for JAX , 2024. URL http://github.com/google/flax

work page 2024

[17] [17]

Long short-term memory

Sepp Hochreiter and J \"u rgen Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997

[18] [18]

Generalizations across filler-gap dependencies in neural language models

Katherine Howitt, Sathvik Nair, Allison Dods, and Robert Melvin Hopkins. Generalizations across filler-gap dependencies in neural language models. In Libby Barak and Malihe Alikhani, editors, Proceedings of the 28th Conference on Computational Natural Language Learning, pages 269--279, Miami, FL, USA, November 2024. Association for Computational Linguisti...

work page doi:10.18653/v1/2024.conll-1.21 2024

[19] [19]

Hyndman and Anne B

Rob J. Hyndman and Anne B. Koehler. Another look at measures of forecast accuracy. International Journal of Forecasting, 22 0 (4): 0 679--688, 2006. ISSN 0169-2070. doi:https://doi.org/10.1016/j.ijforecast.2006.03.001. URL https://www.sciencedirect.com/science/article/pii/S0169207006000239

work page doi:10.1016/j.ijforecast.2006.03.001 2006

[20] [20]

Learning delays in spiking neural networks using dilated convolutions with learnable spacings

Ismail Khalfaoui-Hassani, Thomas Pellegrini, and Timoth \'e e Masquelier. Learning delays in spiking neural networks using dilated convolutions with learnable spacings. In Differentiable Almost Everything Workshop of the 40-th International Conference on Machine Learning, 2023. URL https://arxiv.org/abs/2306.00817

work page arXiv 2023

[21] [21]

Algebraic theory of machines, 1965

Kenneth Krohn and John Rhodes. Algebraic theory of machines, 1965

work page 1965

[22] [22]

MNIST handwritten digit database

Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/

work page 2010

[23] [23]

What makes convolutional models great on long sequence modeling? In The Eleventh International Conference on Learning Representations, 2023

Yuhong Li, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. What makes convolutional models great on long sequence modeling? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=TGJSPbRpJX-

work page 2023

[24] [24]

Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Exposing attention glitches with flip-flop language modeling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=VzmpXQAn6E

work page 2023

[25] [25]

Deterministic nonperiodic flow

Edward Norton Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20: 0 130–141, 1963

work page 1963

[26] [26]

Parallelizing linear recurrent neural nets over sequence length

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net, 2018. URL https://openreview.net/forum?id=HyUNwulC-

work page 2018

[27] [27]

Context dependent recurrent neural network language model

Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234--239, 2012. doi:10.1109/SLT.2012.6424228

work page doi:10.1109/slt.2012.6424228 2012

[28] [28]

Neural net architectures for temporal sequence processing

Michael Mozer. Neural net architectures for temporal sequence processing. Santa Fe Institute Studies in The Sciences of Complexity, 15: 0 243--243, 03 1993

work page 1993

[29] [29]

Resurrecting recurrent neural networks for long sequences, 2023

Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, C aglar G \" u l c ehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. CoRR, abs/2303.06349, 2023. doi:10.48550/arXiv.2303.06349. URL https://doi.org/10.48550/arXiv.2303.06349

work page doi:10.48550/arxiv.2303.06349 2023

[30] [30]

Antonio Orvieto, Soham De, Caglar Gulcehre, Razvan Pascanu, and Samuel L. Smith. Universality of linear recurrences followed by non-linear projections: Finite-width guarantees and benefits of complex eigenvalues. In ICML, 2024. URL https://openreview.net/forum?id=47ahBl70xb

work page 2024

[31] [31]

Delay embedding theory of neural sequence models

Mitchell Ostrow, Adam Eisen, and Ila Fiete. Delay embedding theory of neural sequence models. 2024. URL https://arxiv.org/abs/2406.11993v1

work page arXiv 2024

[32] [32]

Regularization and nonlinearities for neural language models: when are they needed?

Marius Pachitariu and Maneesh Sahani. Regularization and nonlinearities for neural language models: when are they needed?, 2013. URL https://arxiv.org/abs/1301.5650

work page internal anchor Pith review Pith/arXiv arXiv 2013

[33] [33]

Hierarchically gated recurrent neural network for sequence modeling

Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, ...

work page 2023

[34] [34]

The expressive capacity of state space models: A formal language perspective

Yash Sarrof, Yana Veitsman, and Michael Hahn. The expressive capacity of state space models: A formal language perspective. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eV5YIrJPdy

work page 2024

[35] [35]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[36] [36]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed Chi, Nathanael Sch\" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

work page 2023

[37] [37]

12.4 Chemical chaos and attractor reconstruction

Steven Strogatz. 12.4 Chemical chaos and attractor reconstruction. CRC Press, 2015

work page 2015

[38] [38]

Detecting strange attractors in turbulence

Floris Takens. Detecting strange attractors in turbulence. Dynamical Systems and Turbulence, Lecture Notes in Mathematics, 898: 0 366–381, 1981

work page 1981

[39] [39]

Selecting embedding delays: An overview of embedding techniques and a new method using persistent homology

Eugene Tan, Shannon Algar, Débora Corrêa, Michael Small, Thomas Stemler, and David Walker. Selecting embedding delays: An overview of embedding techniques and a new method using persistent homology. Chaos: An Interdisciplinary Journal of Nonlinear Science, 33 0 (3): 0 032101, 03 2023. ISSN 1054-1500. doi:10.1063/5.0137223. URL https://doi.org/10.1063/5.0137223

work page doi:10.1063/5.0137223 2023

[40] [40]

Long range arena : A benchmark for efficient transformers

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 . OpenReview.net, 2021. URL https://openreview.net...

work page 2021

[41] [41]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio, 2016. URL https://arxiv.org/abs/1609.03499

work page internal anchor Pith review Pith/arXiv arXiv 2016

[42] [42]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL http...

work page 2017

[43] [43]

Waibel, T

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K.J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37 0 (3): 0 328--339, 1989. doi:10.1109/29.21701

work page doi:10.1109/29.21701 1989

[44] [44]

Ethan Wilcox, Roger Levy, Takashi Morita, and Richard Futrell. What do RNN language models learn about filler -- gap dependencies? In Tal Linzen, Grzegorz Chrupa a, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 211--221, Brussels, Belgium, November 2018. Associ...

work page doi:10.18653/v1/w18-5423 2018

[45] [45]

Masked feature prediction for self-supervised visual pre-training

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pages 10809--10819. IEEE , 2022. doi:10.1109/CVPR52688.2022.01055. URL https://doi.org/1...

work page doi:10.1109/cvpr52688.2022.01055 2022