Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Pith reviewed 2026-05-22 04:48 UTC · model grok-4.3
The pith
Gated DeltaNet-2 decouples erase and write operations through separate channel-wise gates to let linear attention edit its fixed-size memory state more precisely.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that the delta-rule update improves when the subtraction of prior key associations is scaled by a distinct channel-wise erase gate b_t and the addition of new value information is scaled by a separate channel-wise write gate w_t. This Gated Delta Rule-2 reduces exactly to KDA when the two gates are set equal and to Gated DeltaNet when decay also collapses to a scalar. The update admits a fast-weight view and a chunkwise WY algorithm that folds the decay into asymmetric erase factors, together with a gate-aware backward pass that keeps parallel training efficient.
What carries the argument
The channel-wise erase gate b_t and write gate w_t that independently scale the subtraction of old content and the addition of new content inside the recurrent state update.
If this is right
- The model records the strongest overall scores among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and retrieval after training on 100B tokens.
- The largest measured gains appear on long-context multi-key retrieval settings within the RULER benchmark.
- Performance remains strong in both pure recurrent mode and hybrid recurrent-plus-attention configurations.
- The formulation collapses cleanly to earlier models when the new gates are tied together or further simplified.
Where Pith is reading between the lines
- If the independent gates remain stable at larger scales, the same separation could be inserted into other recurrent or state-space layers to raise memory fidelity without increasing state size.
- Targeted channel-wise control may reduce the state dimension needed for a given task once edits become more selective.
- The technique could be tested on retrieval-augmented or multi-document settings to check whether the precision benefit grows with context length.
Load-bearing premise
The added channel-wise erase and write gates supply independent, stable control over memory updates without requiring extra regularization or hyperparameter tuning beyond the baselines.
What would settle it
A head-to-head evaluation at the 1.3B scale on the RULER multi-key retrieval task showing no gain over KDA or Gated DeltaNet under the same training budget would indicate that the decoupled gates do not deliver the claimed editing advantage.
read the original abstract
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gated DeltaNet-2, which extends linear attention by decoupling erase and write operations via separate channel-wise gates b_t (erase) and w_t (write). It generalizes Gated DeltaNet and KDA (reducing to each when gates collapse), derives a fast-weight update, chunkwise WY algorithm with asymmetric erase factors, and gate-aware backward pass for efficient training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, it reports the strongest results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants on language modeling, commonsense reasoning, and retrieval, with largest gains on long-context RULER multi-key retrieval in both recurrent and hybrid settings. Code is released.
Significance. If the performance lift is causally attributable to the decoupled gates rather than added capacity, the work would strengthen the case for role-separated memory control in linear recurrent models and could improve long-context retrieval without quadratic attention costs. The provision of reproducible code, the reduction to prior models, and the derivation of parallel training algorithms are clear strengths that support verifiability.
major comments (2)
- [Experiments] Experiments section: the headline claim of superior performance (especially on RULER multi-key retrieval) at fixed 1.3B scale rests on comparisons to scalar-gate baselines, yet no ablation is described that holds total parameter count or FLOPs constant (e.g., by widening other layers when b_t and w_t are tied to scalars). Without this control, it remains unclear whether the observed gains derive from the erase/write separation or from the extra per-channel degrees of freedom.
- [§3 and Experiments] §3 (model derivation) and Experiments: while the manuscript states that Gated DeltaNet-2 reduces to KDA when both gates collapse to the same scalar, the empirical tables do not include a direct re-implementation of that collapsed variant under identical hyper-parameters and training budget, leaving the incremental benefit of the full decoupling unisolated.
minor comments (2)
- [§3] Notation for the asymmetric erase factors in the chunkwise WY algorithm could be clarified with an explicit equation relating b_t to the decay term.
- [Tables] Table captions should explicitly state whether reported numbers are averages over multiple seeds or single runs.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The points raised regarding experimental controls are well-taken and will help strengthen the isolation of the proposed decoupling mechanism. We address each major comment below and commit to revisions that incorporate the suggested controls.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline claim of superior performance (especially on RULER multi-key retrieval) at fixed 1.3B scale rests on comparisons to scalar-gate baselines, yet no ablation is described that holds total parameter count or FLOPs constant (e.g., by widening other layers when b_t and w_t are tied to scalars). Without this control, it remains unclear whether the observed gains derive from the erase/write separation or from the extra per-channel degrees of freedom.
Authors: We agree that an ablation holding total parameter count fixed would provide stronger evidence that gains arise from the erase/write decoupling rather than added capacity. In the revised manuscript we will add such a control by widening layers in the scalar-gate baselines (Gated DeltaNet and KDA) to match the parameter count of Gated DeltaNet-2 while keeping training budget identical. This will be reported alongside the existing results. revision: yes
-
Referee: [§3 and Experiments] §3 (model derivation) and Experiments: while the manuscript states that Gated DeltaNet-2 reduces to KDA when both gates collapse to the same scalar, the empirical tables do not include a direct re-implementation of that collapsed variant under identical hyper-parameters and training budget, leaving the incremental benefit of the full decoupling unisolated.
Authors: The referee is correct that, although the reduction to KDA is derived in §3, the experimental tables do not report a matched re-implementation of the collapsed scalar-gate variant. We will add this baseline (both gates tied to the same scalar under identical hyperparameters and training) to the main results tables in the revision to directly quantify the incremental benefit of channel-wise decoupling. revision: yes
Circularity Check
No significant circularity in derivation or claims
full rationale
The manuscript defines Gated DeltaNet-2 via explicit new learned parameters (channel-wise erase gate b_t and write gate w_t) that generalize prior scalar-gate models by construction of the recurrence, then derives equivalent reformulations (fast-weight view, chunkwise WY algorithm with asymmetric erase factors, gate-aware backward pass) that are direct algebraic rewrites of the same forward equations rather than fitted predictions or self-referential loops. Central performance claims rest on independent empirical training runs and benchmark evaluations (language modeling, RULER retrieval) at fixed scale, not on any quantity that reduces to the model inputs by definition. Self-citations to Gated DeltaNet and KDA function only as baseline references and are not invoked as uniqueness theorems or load-bearing ansatzes; no step equates a derived result to a fitted input or renames an empirical pattern as a first-principles outcome.
Axiom & Free-Parameter Ledger
free parameters (2)
- channel-wise erase gate b_t
- channel-wise write gate w_t
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
local online problem L_t(S) = ||S - barS_t||_F^2 - 2 <S^T k_t, z_t - barS_t^T e_t>
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020
work page 2020
-
[2]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pages 9355–9366. PMLR, 2021
work page 2021
-
[3]
Zoology: Measuring and improving recall in efficient language models
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[4]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 1763–1840. PMLR, 2024
work page 2024
-
[5]
Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 21502–21521. PMLR, 2024
work page 2024
-
[6]
Rnns are not transformers (yet): The key bottleneck on in-context retrieval
Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[7]
In-context language learning: Architectures and algorithms
Ekin Akyürek, Bailin Wang, Yoon Kim, and Jacob Andreas. In-context language learning: Architectures and algorithms. In Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 787–812. PMLR, 2024
work page 2024
-
[8]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 10041–10071. PMLR, 2024
work page 2024
-
[9]
Bernard Widrow, Marcian E Hoff, et al. Adaptive switching circuits. InIRE WESCON convention record, volume 4, pages 96–104. New York, 1960
work page 1960
-
[10]
Parallelizing linear transformers with the delta rule over sequence length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems 37, pages 115491–115522, 2024
work page 2024
-
[11]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[12]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Li, Berlin Chen, Caitlin Wang, Aviv Bick, J
Aakash Lahoti, Kevin Y . Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[14]
Christian H. Bischof and Charles Van Loan. The WY representation for products of householder matrices. InSIAM Conference on Parallel Processing for Scientific Computing, 1985
work page 1985
-
[15]
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Research, pages 90...
work page 2022
-
[16]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.ArXiv preprint, abs/2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Gated linear attention transformers with hardware- efficient training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware- efficient training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 56501–56523. PMLR, 2024. 18 Gated DeltaNet-2: Decoupling Erase and Write in Linear ...
work page 2024
-
[18]
Kazuki Irie, Róbert Csordás, and Jürgen Schmidhuber. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022,...
work page 2022
-
[19]
Learning to (learn at test time): Rnns with expressive hidden states
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): Rnns with expressive hidden states. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning ...
work page 2025
-
[20]
Riccardo Grazzi, Julien Siems, Arber Zela, Jörg K. H. Franke, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear rnns through negative eigenvalues. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[21]
Longhorn: State space models are amortized online learners
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations, volume 2025, pages 95419–95434, 2025
work page 2025
-
[22]
Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Ortí, Robert A. van de Geijn, and Field G. Van Zee. Accumulating householder transformations, revisited.ACM Trans. Math. Softw., 32:169–179, 2006
work page 2006
-
[23]
Philippe Tillet, Hsiang-Tsung Kung, and David D. Cox. Triton: an intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. ACM, 2019
work page 2019
-
[24]
Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, and Caglar Gulcehre. Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficien...
work page 2024
-
[25]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[26]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíˇcek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.ArXiv preprint, abs/2406.17557, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017
work page 2017
-
[28]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguisti...
work page 2016
-
[29]
PIQA: reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan LeBras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in A...
work page 2020
-
[30]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, 2019. Association for Computational Linguistics
work page 2019
-
[31]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. InThe Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Arti...
work page 2020
-
[32]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.ArXiv preprint, abs/1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2...
work page 2018
-
[34]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processi...
work page 2019
-
[35]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hum...
work page 2019
-
[36]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?ArXiv preprint, abs/2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Just read twice: closing the recall gap for recurrent language models
Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashish Rao, Atri Rudra, and Christopher Ré. Just read twice: closing the recall gap for recurrent language models. InProceedings of the 2nd Efficient Systems for Foundation Models Workshop at the International Conference on Machine Learning (ICML), volume 235 of Pr...
work page 2024
-
[38]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022
work page 2022
-
[39]
Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[40]
Smith, Albert Gu, Anushan Fernando, Çaglar Gülçehre, Razvan Pascanu, and Soham De
Antonio Orvieto, Samuel L. Smith, Albert Gu, Anushan Fernando, Çaglar Gülçehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023,...
work page 2023
-
[41]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. 2023
work page 2023
-
[42]
Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.ArXiv preprint, abs/2404.07904, 2024
-
[43]
E. Gardner. The space of interactions in neural network models.Journal of Physics A, 21:257–270, 1988
work page 1988
-
[44]
Neural network capacity using delta rule.Electronics Letters, 3(25):197–199, 1989
DL Prados and SC Kak. Neural network capacity using delta rule.Electronics Letters, 3(25):197–199, 1989
work page 1989
-
[45]
Going beyond linear transformers with recurrent fast weight programmers
Kazuki Irie, Imanol Schlag, Róbert Csordás, and Jürgen Schmidhuber. Going beyond linear transformers with recurrent fast weight programmers. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing...
work page 2021
-
[46]
OpenCeres: When open information extraction meets the semi- structured web
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. OpenCeres: When open information extraction meets the semi- structured web. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Sho...
work page 2019
-
[47]
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes, 2023
work page 2023
-
[48]
Know what you don’t know: Unanswerable questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia, 2018. Association for Computational Linguistics
work page 2018
-
[49]
TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, 2017. Associ...
work page 2017
-
[50]
DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu...
work page 2019
-
[51]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.