pith. sign in

arxiv: 2605.22791 · v1 · pith:THME4DQBnew · submitted 2026-05-21 · 💻 cs.AI

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Pith reviewed 2026-05-22 04:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords linear attentiondelta rulegated mechanismsmemory updateslong-context retrievalrecurrent statechannel-wise gateserase and write
0
0 comments X

The pith

Gated DeltaNet-2 decouples erase and write operations through separate channel-wise gates to let linear attention edit its fixed-size memory state more precisely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear attention compresses sequence history into a recurrent state instead of an expanding cache, but the quality of that compression depends on how cleanly the model can remove outdated associations while inserting new ones. Earlier delta-rule approaches controlled both the removal and the insertion with a single scalar gate, which forced a compromise when the two operations needed different strengths across channels. Gated DeltaNet-2 adds an independent channel-wise erase gate and an independent channel-wise write gate, inheriting adaptive forgetting and channel decay from prior work while removing the tie between the two actions. The result is a model that preserves useful prior content more reliably during updates. If the separation holds up, it points to memory-edit precision as a direct lever for making constant-memory recurrent models competitive on tasks that stress long-range associations.

Core claim

The paper shows that the delta-rule update improves when the subtraction of prior key associations is scaled by a distinct channel-wise erase gate b_t and the addition of new value information is scaled by a separate channel-wise write gate w_t. This Gated Delta Rule-2 reduces exactly to KDA when the two gates are set equal and to Gated DeltaNet when decay also collapses to a scalar. The update admits a fast-weight view and a chunkwise WY algorithm that folds the decay into asymmetric erase factors, together with a gate-aware backward pass that keeps parallel training efficient.

What carries the argument

The channel-wise erase gate b_t and write gate w_t that independently scale the subtraction of old content and the addition of new content inside the recurrent state update.

If this is right

  • The model records the strongest overall scores among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and retrieval after training on 100B tokens.
  • The largest measured gains appear on long-context multi-key retrieval settings within the RULER benchmark.
  • Performance remains strong in both pure recurrent mode and hybrid recurrent-plus-attention configurations.
  • The formulation collapses cleanly to earlier models when the new gates are tied together or further simplified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the independent gates remain stable at larger scales, the same separation could be inserted into other recurrent or state-space layers to raise memory fidelity without increasing state size.
  • Targeted channel-wise control may reduce the state dimension needed for a given task once edits become more selective.
  • The technique could be tested on retrieval-augmented or multi-document settings to check whether the precision benefit grows with context length.

Load-bearing premise

The added channel-wise erase and write gates supply independent, stable control over memory updates without requiring extra regularization or hyperparameter tuning beyond the baselines.

What would settle it

A head-to-head evaluation at the 1.3B scale on the RULER multi-key retrieval task showing no gain over KDA or Gated DeltaNet under the same training budget would indicate that the decoupled gates do not deliver the claimed editing advantage.

read the original abstract

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Gated DeltaNet-2, which extends linear attention by decoupling erase and write operations via separate channel-wise gates b_t (erase) and w_t (write). It generalizes Gated DeltaNet and KDA (reducing to each when gates collapse), derives a fast-weight update, chunkwise WY algorithm with asymmetric erase factors, and gate-aware backward pass for efficient training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it reports the strongest results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and retrieval, with largest gains on long-context RULER multi-key retrieval in both recurrent and hybrid settings. Code is released.

Significance. If the performance lift is causally attributable to the decoupled gates rather than added capacity, the work would strengthen the case for role-separated memory control in linear recurrent models and could improve long-context retrieval without quadratic attention costs. The provision of reproducible code, the reduction to prior models, and the derivation of parallel training algorithms are clear strengths that support verifiability.

major comments (2)
  1. [Experiments] Experiments section: the headline claim of superior performance (especially on RULER multi-key retrieval) at fixed 1.3B scale rests on comparisons to scalar-gate baselines, yet no ablation is described that holds total parameter count or FLOPs constant (e.g., by widening other layers when b_t and w_t are tied to scalars). Without this control, it remains unclear whether the observed gains derive from the erase/write separation or from the extra per-channel degrees of freedom.
  2. [§3 and Experiments] §3 (model derivation) and Experiments: while the manuscript states that Gated DeltaNet-2 reduces to KDA when both gates collapse to the same scalar, the empirical tables do not include a direct re-implementation of that collapsed variant under identical hyper-parameters and training budget, leaving the incremental benefit of the full decoupling unisolated.
minor comments (2)
  1. [§3] Notation for the asymmetric erase factors in the chunkwise WY algorithm could be clarified with an explicit equation relating b_t to the decay term.
  2. [Tables] Table captions should explicitly state whether reported numbers are averages over multiple seeds or single runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The points raised regarding experimental controls are well-taken and will help strengthen the isolation of the proposed decoupling mechanism. We address each major comment below and commit to revisions that incorporate the suggested controls.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim of superior performance (especially on RULER multi-key retrieval) at fixed 1.3B scale rests on comparisons to scalar-gate baselines, yet no ablation is described that holds total parameter count or FLOPs constant (e.g., by widening other layers when b_t and w_t are tied to scalars). Without this control, it remains unclear whether the observed gains derive from the erase/write separation or from the extra per-channel degrees of freedom.

    Authors: We agree that an ablation holding total parameter count fixed would provide stronger evidence that gains arise from the erase/write decoupling rather than added capacity. In the revised manuscript we will add such a control by widening layers in the scalar-gate baselines (Gated DeltaNet and KDA) to match the parameter count of Gated DeltaNet-2 while keeping training budget identical. This will be reported alongside the existing results. revision: yes

  2. Referee: [§3 and Experiments] §3 (model derivation) and Experiments: while the manuscript states that Gated DeltaNet-2 reduces to KDA when both gates collapse to the same scalar, the empirical tables do not include a direct re-implementation of that collapsed variant under identical hyper-parameters and training budget, leaving the incremental benefit of the full decoupling unisolated.

    Authors: The referee is correct that, although the reduction to KDA is derived in §3, the experimental tables do not report a matched re-implementation of the collapsed scalar-gate variant. We will add this baseline (both gates tied to the same scalar under identical hyperparameters and training) to the main results tables in the revision to directly quantify the incremental benefit of channel-wise decoupling. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The manuscript defines Gated DeltaNet-2 via explicit new learned parameters (channel-wise erase gate b_t and write gate w_t) that generalize prior scalar-gate models by construction of the recurrence, then derives equivalent reformulations (fast-weight view, chunkwise WY algorithm with asymmetric erase factors, gate-aware backward pass) that are direct algebraic rewrites of the same forward equations rather than fitted predictions or self-referential loops. Central performance claims rest on independent empirical training runs and benchmark evaluations (language modeling, RULER retrieval) at fixed scale, not on any quantity that reduces to the model inputs by definition. Self-citations to Gated DeltaNet and KDA function only as baseline references and are not invoked as uniqueness theorems or load-bearing ansatzes; no step equates a derived result to a fitted input or renames an empirical pattern as a first-principles outcome.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The model introduces two new learned gates per channel but does not postulate new physical entities or ungrounded mathematical objects; the main added degrees of freedom are the parameters of the two gates themselves.

free parameters (2)
  • channel-wise erase gate b_t
    Learned per-channel parameter controlling how much old content to subtract before writing.
  • channel-wise write gate w_t
    Learned per-channel parameter controlling how much new content to commit.

pith-pipeline@v0.9.0 · 5870 in / 1210 out tokens · 36844 ms · 2026-05-22T04:48:52.864481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 5 internal anchors

  1. [1]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

  2. [2]

    Linear transformers are secretly fast weight programmers

    Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 2021

  3. [3]

    Zoology: Measuring and improving recall in efficient language models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. InThe Twelfth International Conference on Learning Representations, 2024

  4. [4]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 1763–1840. PMLR, 2024

  5. [5]

    Kakade, and Eran Malach

    Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 21502–21521. PMLR, 2024

  6. [6]

    Rnns are not transformers (yet): The key bottleneck on in-context retrieval

    Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. In The Thirteenth International Conference on Learning Representations, 2025

  7. [7]

    In-context language learning: Architectures and algorithms

    Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. In-context language learning: Architectures and algorithms. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 787–812. PMLR, 2024

  8. [8]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 10041–10071. PMLR, 2024

  9. [9]

    Adaptive switching circuits

    Bernard Widrow, Marcian E Hoff, et al. Adaptive switching circuits. InIRE WESCON convention record, volume 4, pages 96–104. New York, 1960

  10. [10]

    Parallelizing linear transformers with the delta rule over sequence length

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems 37, pages 115491–115522, 2024

  11. [11]

    Gated delta networks: Improving mamba2 with delta rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025

  12. [12]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

  13. [13]

    Li, Berlin Chen, Caitlin Wang, Aviv Bick, J

    Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InThe Fourteenth International Conference on Learning Representations, 2026

  14. [14]

    Bischof and Charles Van Loan

    Christian H. Bischof and Charles Van Loan. The WY representation for products of householder matrices. InSIAM Conference on Parallel Processing for Scientific Computing, 1985

  15. [15]

    Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 90...

  16. [16]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.ArXiv preprint, abs/2307.08621, 2023

  17. [17]

    Gated linear attention transformers with hardware- efficient training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware- efficient training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 56501–56523. PMLR, 2024. 18 Gated DeltaNet-2: Decoupling Erase and Write in Linear ...

  18. [18]

    The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention

    Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022,...

  19. [19]

    Learning to (learn at test time): Rnns with expressive hidden states

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning ...

  20. [20]

    Riccardo Grazzi, Julien Siems, Arber Zela, Jörg K. H. Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Longhorn: State space models are amortized online learners

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations, volume 2025, pages 95419–95434, 2025

  22. [22]

    Quintana-Ortí, Robert A

    Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Ortí, Robert A. van de Geijn, and Field G. Van Zee. Accumulating householder transformations, revisited.ACM Trans. Math. Softw., 32:169–179, 2006

  23. [23]

    Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, 2019

  24. [24]

    Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficien...

  25. [25]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.ArXiv preprint, abs/2406.17557, 2024

  27. [27]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017

  28. [28]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguisti...

  29. [29]

    PIQA: reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in A...

  30. [30]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics

  31. [31]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Arti...

  32. [32]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv preprint, abs/1803.05457, 2018

  33. [33]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2...

  34. [34]

    Social IQa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processi...

  35. [35]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hum...

  36. [36]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?ArXiv preprint, abs/2404.06654, 2024

  37. [37]

    Just read twice: closing the recall gap for recurrent language models

    Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models. InProceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML), volume 235 of Pr...

  38. [38]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  39. [39]

    Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

  40. [40]

    Smith, Albert Gu, Anushan Fernando, Çaglar Gülçehre, Razvan Pascanu, and Soham De

    Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, Çaglar Gülçehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023,...

  41. [41]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023

  42. [42]

    Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024

  43. [43]

    E. Gardner. The space of interactions in neural network models.Journal of Physics A, 21:257–270, 1988

  44. [44]

    Neural network capacity using delta rule.Electronics Letters, 3(25):197–199, 1989

    DL Prados and SC Kak. Neural network capacity using delta rule.Electronics Letters, 3(25):197–199, 1989

  45. [45]

    Going beyond linear transformers with recurrent fast weight programmers

    Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing...

  46. [46]

    OpenCeres: When open information extraction meets the semi- structured web

    Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. OpenCeres: When open information extraction meets the semi- structured web. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Sho...

  47. [47]

    Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes, 2023

    Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes, 2023

  48. [48]

    Know what you don’t know: Unanswerable questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia, 2018. Association for Computational Linguistics

  49. [49]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. Associ...

  50. [50]

    DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...

  51. [51]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...