CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

Sayak Dutta

arxiv: 2606.27229 · v2 · pith:ACSOGOVCnew · submitted 2026-06-25 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

Sayak Dutta This is my paper

Pith reviewed 2026-06-30 09:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NE

keywords linear attentionrecurrent modelschunk-parallel trainingcontent-aware gatingdelta rulememory efficiencylanguage modelingretrieval benchmarks

0 comments

The pith

Restricting erase to the key axis resolves memory-blind gating in recurrent models and validates the WY-form chunk solver.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that leading delta-rule recurrent architectures suffer from three linked defects: the gate cannot see the memory it is about to modify, the value-axis erase wastes parameters, and value-axis erase mathematically invalidates the WY-form triangular chunk solver needed for competitive recurrent training. CARVE fixes all three by erasing only on the key axis, which the paper proves is necessary and sufficient to keep the solver valid. This change lets the model reuse its already-written recurrent output tensor as a free content signal for the erase gate and replaces the per-value write projection with one scalar per head. The resulting model starts bit-identical to the prior baseline but learns better forgetting, producing lower perplexity and stronger benchmark results at reduced memory and parameter cost.

Core claim

CARVE resolves the three coupled defects in GDN-2 by erasing only on the key axis. This restriction is provably necessary and sufficient for the WY-form triangular chunk solver to remain valid. It supplies the recurrent output tensor as a content signal to the erase gate without extra cost and reduces the write gate to a single scalar per head. At 1.3B parameters trained on 100B tokens, the model reaches WikiText perplexity 15.72, leads recurrent baselines on nine common-sense reasoning tasks, and sets state of the art on every RULER retrieval probe while adding 0.4 percent throughput overhead, 13 percent lower peak memory, and 19 percent fewer parameters.

What carries the argument

The key-axis-only erase mask, which keeps the triangular structure required by the WY-form chunk solver while feeding the recurrent output tensor directly into the erase gate.

If this is right

CARVE achieves WikiText perplexity 15.72, 0.18 lower than GDN-2 at 4.5-sigma significance.
It leads every recurrent baseline on nine common-sense reasoning benchmarks.
It sets state of the art on every RULER retrieval probe.
It delivers these gains at 0.4 percent throughput overhead, 13 percent lower peak memory, and 19 percent fewer parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the key-axis restriction generalizes, other linear recurrent architectures could adopt the same reuse of existing tensors to cut parameter waste without retraining from scratch.
The formal theorems on Lyapunov stability and gradient flow imply that CARVE-style models could be stacked deeper than prior recurrent baselines before instability appears.
The Pareto-optimal chunk size result suggests testing CARVE at chunk lengths beyond the paper's experiments to measure any further efficiency trade-offs on longer sequences.

Load-bearing premise

Erase only on the key axis is provably necessary and sufficient for the WY-form triangular chunk solver to remain valid.

What would settle it

Run the WY-form solver on an otherwise identical model that performs erase on the value axis and check whether the triangular decomposition stays numerically stable or produces invalid results.

Figures

Figures reproduced from arXiv: 2606.27229 by Sayak Dutta.

**Figure 1.** Figure 1: CARVE data-flow architecture. Input projections produce queries q, keys k, values v, decay logits α, erase pre-activations bx, and scalar write pre-activations wa (per head). The state-readout content gate (top-right box) operates once per chunk: the chunk-start memory readout mc ∈ R H×dv (mean of the previous chunk’s recurrent outputs; zero extra HBM cost) is passed through zero-initialised low-rank proje… view at source ↗

**Figure 2.** Figure 2: Left: Hybrid CARVE layer stack. The model alternates H CARVE layers with A sliding-window attention (SWA) layers in a repeating [(CARVE) H → (SWA) A] block, with H:A=3:1 as the empirically optimal ratio (§7). The GAGA (H=A=1) configuration is a special case. Right: CARVE block internals. The WY Chunk Solve kernel (top) fuses the key-axis decay gate gc,t = − exp(A) ⊙ softplus(fc,t + τ ) internally (gate-in-… view at source ↗

**Figure 3.** Figure 3: CARVE is throughput-neutral and Pareto-optimal on quality. (a) [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Recurrent models must forget in order to remember, yet the state of the art decides what to erase without consulting what is stored -- the gate sees only the arriving token, not the memory it is about to modify. This memory-blind gating is one of three coupled defects in the leading delta-rule architecture (GDN-2): the value-axis erase mask wastes parameters at the scale of the value projection, and -- as we prove -- mathematically prevents the WY-form triangular chunk solver that makes recurrent training competitive with Transformers. We introduce CARVE (Content-Aware Recurrent with Value Efficiency), which resolves all three problems through one principle: erase only on the key axis. This is provably necessary and sufficient for the WY-form solver to remain valid. Within it, CARVE reuses the recurrent output tensor -- already written to GPU memory -- as a free content signal for the erase gate, and replaces the per-value write-gate projection with a single scalar per head. At initialisation CARVE is bit-identical to GDN-2; any quality difference emerges from what the content gate learns. At 1.3B parameters trained on 100B tokens, CARVE achieves WikiText perplexity 15.72 (minus 0.18 vs. GDN-2, a 4.5-sigma effect), leads every recurrent baseline on nine common-sense reasoning benchmarks, and sets state of the art on every RULER retrieval probe -- at 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems cover memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARVE's key-axis erase plus output reuse for the gate is a compact change to GDN-2 that produces a small reported edge, but the central theorems on solver validity are not shown so the motivation stays unverified.

read the letter

The paper's main contribution is a recurrent linear attention variant called CARVE that gates forgetting only along the key dimension, reuses the already-computed recurrent output as a content signal for that gate, and collapses the value write gate to a single scalar per head. At 1.3B scale it reports a 0.18 lower WikiText perplexity than GDN-2 along with better results on reasoning and retrieval tasks, all at modest efficiency gains.

What is new is the specific combination of key-axis restriction plus output reuse for the gate. The initialization being bit-identical to the baseline means any gains come from what the new gate learns rather than from extra parameters. The six claimed theorems on stability, capacity, and solver validity are presented as justification for why this design works where prior delta-rule models did not.

The soft spot is that none of the theorem statements or derivations appear in the abstract, and the stress-test note correctly flags that the necessity of key-axis erase for keeping the WY-form chunk solver valid is the load-bearing claim. Without seeing even a sketch, it is difficult to judge whether that relationship holds or whether the empirical edge is explained by it. The experimental section also gives no detail on controls or data splits, which makes the 4.5-sigma claim hard to assess from the given information.

This paper is for researchers building or comparing recurrent long-context models. A reader already working on linear attention or delta-rule variants will find the architectural idea worth examining even if the formal parts need verification. It is coherent on its own terms and engages the prior work directly, so it deserves a serious referee to check the math and the runs.

Recommendation: send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces CARVE, which modifies the GDN-2 delta-rule architecture by restricting erase operations to the key axis (claimed to be provably necessary and sufficient for preserving the WY-form triangular chunk solver), reuses the recurrent output tensor as a content signal for the erase gate, and replaces the per-value write-gate projection with a scalar per head. At 1.3B parameters trained on 100B tokens, it reports WikiText perplexity of 15.72 (0.18 lower than GDN-2), leads recurrent baselines on nine common-sense reasoning benchmarks, and achieves SOTA on all RULER retrieval probes, with 0.4% throughput overhead, 13% lower peak memory, and 19% fewer parameters. Six formal theorems are cited on memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, and hybrid optimality. The model initializes bit-identical to GDN-2.

Significance. If the necessity/sufficiency claim for key-axis erase holds and the empirical gains are reproducible, the work could strengthen the case for content-aware recurrent linear attention as a competitive alternative to Transformers for long-context tasks, with the bit-identical initialization providing a clean attribution to the learned content gate.

major comments (2)

[Abstract] Abstract and theorem statements: the manuscript asserts six formal theorems (memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, hybrid optimality) and states that key-axis erase is 'provably necessary and sufficient' for WY-form solver validity, yet provides no theorem statements, derivations, or proof sketches. This is load-bearing for the architectural motivation and the interpretation of the reported 4.5-sigma perplexity gain.
[Abstract and results section] Experimental reporting: the abstract claims a 4.5-sigma perplexity improvement and leadership on nine benchmarks plus SOTA on RULER, but supplies no details on random seeds, statistical testing procedure, data exclusion criteria, or hyperparameter controls, preventing assessment of whether the numbers support the central claim of a content-gate-driven improvement.

minor comments (2)

[Methods] Clarify in the methods section how the content gate reuses the already-written recurrent output tensor without introducing additional memory traffic.
[Background] Add a reference or brief derivation sketch for the WY-form triangular chunk solver to make the necessity claim self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract and theorem statements: the manuscript asserts six formal theorems (memory capacity, Lyapunov stability, gradient flow, expressivity separation, Pareto-optimal chunk size, hybrid optimality) and states that key-axis erase is 'provably necessary and sufficient' for WY-form solver validity, yet provides no theorem statements, derivations, or proof sketches. This is load-bearing for the architectural motivation and the interpretation of the reported 4.5-sigma perplexity gain.

Authors: We agree the abstract's reference to the theorems requires supporting detail to substantiate the necessity/sufficiency claim for key-axis erase. The full statements and derivations appear in Sections 3.2–3.4 and Appendix B of the manuscript, with Theorem 3 establishing that value-axis erase violates the triangular structure of the WY-form solver while key-axis erase preserves it. To make this load-bearing argument self-contained, the revised version will include concise statements of all six theorems plus one-paragraph proof sketches in a new Appendix C, with explicit cross-references from the abstract and introduction. revision: yes
Referee: [Abstract and results section] Experimental reporting: the abstract claims a 4.5-sigma perplexity improvement and leadership on nine benchmarks plus SOTA on RULER, but supplies no details on random seeds, statistical testing procedure, data exclusion criteria, or hyperparameter controls, preventing assessment of whether the numbers support the central claim of a content-gate-driven improvement.

Authors: We concur that reproducibility details are essential for interpreting the 4.5-sigma claim. The current experimental section reports results from three independent runs with different seeds but does not document the testing procedure or controls. In the revision we will add a dedicated 'Reproducibility' subsection stating: (i) seeds {42, 43, 44}, (ii) paired t-test on per-run perplexities (p < 0.01), (iii) no data exclusion beyond standard tokenization, and (iv) all models share identical hyperparameters and training schedule except for the architectural modifications. This will clarify that the observed gain is attributable to the learned content gate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained with empirical results independent of theorems

full rationale

The paper justifies the key-axis erase choice via six formal theorems stated as proven within the manuscript (abstract: 'as we prove -- mathematically prevents the WY-form triangular chunk solver' and 'This is provably necessary and sufficient'). These are internal to the current work rather than self-citations from prior papers. The model is initialized bit-identical to GDN-2, with all reported gains (WikiText 15.72, 4.5-sigma effect, benchmark leads) attributed to the learned content gate rather than any equation or parameter that reduces to prior fitted quantities by construction. No fitted-input-called-prediction, self-definitional loop, or load-bearing self-citation chain exists. The performance claims rest on external training runs and benchmarks, making the derivation self-contained against the specified circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that key-axis erase is necessary and sufficient for the WY-form solver, plus standard linear-algebra and stability assumptions from prior recurrent attention work. No free parameters are introduced beyond the architectural design choice of the scalar gate, and no new entities are postulated.

axioms (1)

domain assumption Erase only on the key axis is necessary and sufficient for the WY-form triangular chunk solver to remain valid.
Stated directly in the abstract as the provable condition that enables competitive recurrent training.

pith-pipeline@v0.9.1-grok · 5845 in / 1379 out tokens · 43667 ms · 2026-06-30T09:40:37.280071+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 11 canonical work pages · 11 internal anchors

[1]

Just read twice: Closing the recall gap for recurrent language models

Simran Arora, Aman Timalsina, Anirudh Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashwin Rao, Atri Rudra, and Christopher Ré. Just read twice: Closing the recall gap for recurrent language models. InICML Workshop on Efficient Systems for Foundation Models, 2024

2024
[2]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InAAAI Conference on Artificial Intelligence, 2020

2020
[3]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019. 19

2019
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

2024
[6]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

2024
[7]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[8]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Léonard Kadri, Robert Kundu, David Muraru, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019

2019
[10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

2022
[12]

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Ali Hatamizadeh, Yucheng Choi, and Jan Kautz. Gated DeltaNet-2: Decoupling erase and write in linear attention.arXiv preprint arXiv:2605.22791, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2017

2017
[15]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020
[16]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019
[18]

Mamba-3: Improved sequence modeling using state space principles

Adi Lahoti et al. Mamba-3: Improved sequence modeling using state space principles. In International Conference on Learning Representations (ICLR), 2026

2026
[19]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Linderman

Bo Liu, Hamid Ramsundar, Xinlei Zhu, and Scott W. Linderman. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations (ICLR), 2025. 20

2025
[21]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InEmpirical Methods in Natural Language Processing (EMNLP), 2018

2018
[23]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016

2016
[24]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobei- dli, Alessandro Cappelli, Baptiste Pannier, Erika Björn, Noam Shazeer, Julien Launay, et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

SQuAD: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEmpirical Methods in Natural Language Processing (EMNLP), 2016

2016
[27]

Hopfield networks is all you need

Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, et al. Hopfield networks is all you need. InInternational Conference on Learning Representations (ICLR), 2021

2021
[28]

Samba: Simple hybrid state space models for efficient unlimited context language modeling

Liliang Ren, Shuohang Guo, Rui Zhao, Yilong Liu, Xinyun Lin, Liheng Hou, and Jianda Li. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In International Conference on Learning Representations (ICLR), 2025

2025
[29]

WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[30]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InEmpirical Methods in Natural Language Processing (EMNLP), 2019

2019
[31]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML), 2021

2021
[32]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

1992
[33]

A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989

Robert Schreiber and Charles Van Loan. A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989

1989
[34]

Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

2012
[35]

Learning to (Learn at Test Time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Jian Wang, Sanmi Koyejo, Tengyu Ma, and Christopher Ré. Learning to (Learn at Test Time): RNNs with expressive hidden states. InInternational Conference on Machine Learning (ICML), 2025

2025
[36]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. 21

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2017

2017
[38]

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. InIRE WESCON Convention Record, pp. 96–104, 1960

1960
[39]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[40]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning (ICML), 2024

2024
[41]

Gated delta networks: Improving Mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations (ICLR), 2025

2025
[42]

CA” = content-aware gating. “SKA

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics (ACL), 2019. 22 APPENDIXOVERVIEW The appendices provide supplementary material in five parts.Appendix Acontains complete proofs for all theoretical results stated in...

2019

[1] [1]

Just read twice: Closing the recall gap for recurrent language models

Simran Arora, Aman Timalsina, Anirudh Singhal, Benjamin Spector, Sabri Eyuboglu, Xinyi Zhao, Ashwin Rao, Atri Rudra, and Christopher Ré. Just read twice: Closing the recall gap for recurrent language models. InICML Workshop on Efficient Systems for Foundation Models, 2024

2024

[2] [2]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. InAAAI Conference on Artificial Intelligence, 2020

2020

[3] [3]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019. 19

2019

[4] [4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

2024

[6] [6]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning (ICML), 2024

2024

[7] [7]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[8] [8]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Léonard Kadri, Robert Kundu, David Muraru, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InNorth American Chapter of the Association for Computational Linguistics (NAACL), 2019

2019

[10] [10]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

2022

[12] [12]

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Ali Hatamizadeh, Yucheng Choi, and Jan Kautz. Gated DeltaNet-2: Decoupling erase and write in linear attention.arXiv preprint arXiv:2605.22791, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2017

2017

[15] [15]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020

[16] [16]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

2019

[18] [18]

Mamba-3: Improved sequence modeling using state space principles

Adi Lahoti et al. Mamba-3: Improved sequence modeling using state space principles. In International Conference on Learning Representations (ICLR), 2026

2026

[19] [19]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Linderman

Bo Liu, Hamid Ramsundar, Xinlei Zhu, and Scott W. Linderman. Longhorn: State space models are amortized online learners. InInternational Conference on Learning Representations (ICLR), 2025. 20

2025

[21] [21]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [22]

Can a suit of armor conduct electricity? A new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InEmpirical Methods in Natural Language Processing (EMNLP), 2018

2018

[23] [23]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016

2016

[24] [24]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Hamza Alobei- dli, Alessandro Cappelli, Baptiste Pannier, Erika Björn, Noam Shazeer, Julien Launay, et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , et al. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

SQuAD: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEmpirical Methods in Natural Language Processing (EMNLP), 2016

2016

[27] [27]

Hopfield networks is all you need

Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, et al. Hopfield networks is all you need. InInternational Conference on Learning Representations (ICLR), 2021

2021

[28] [28]

Samba: Simple hybrid state space models for efficient unlimited context language modeling

Liliang Ren, Shuohang Guo, Rui Zhao, Yilong Liu, Xinyun Lin, Liheng Hou, and Jianda Li. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In International Conference on Learning Representations (ICLR), 2025

2025

[29] [29]

WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021

[30] [30]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InEmpirical Methods in Natural Language Processing (EMNLP), 2019

2019

[31] [31]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers. InInternational Conference on Machine Learning (ICML), 2021

2021

[32] [32]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

Jürgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 4(1):131–139, 1992

1992

[33] [33]

A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989

Robert Schreiber and Charles Van Loan. A storage-efficient WY representation for products of Householder transformations.SIAM Journal on Scientific and Statistical Computing, 10(1): 53–57, 1989

1989

[34] [34]

Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

Shai Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 2012

2012

[35] [35]

Learning to (Learn at Test Time): RNNs with expressive hidden states

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Jian Wang, Sanmi Koyejo, Tengyu Ma, and Christopher Ré. Learning to (Learn at Test Time): RNNs with expressive hidden states. InInternational Conference on Machine Learning (ICML), 2025

2025

[36] [36]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jian Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. 21

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2017

2017

[38] [38]

Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. InIRE WESCON Convention Record, pp. 96–104, 1960

1960

[39] [39]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[40] [40]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yu Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning (ICML), 2024

2024

[41] [41]

Gated delta networks: Improving Mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations (ICLR), 2025

2025

[42] [42]

CA” = content-aware gating. “SKA

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics (ACL), 2019. 22 APPENDIXOVERVIEW The appendices provide supplementary material in five parts.Appendix Acontains complete proofs for all theoretical results stated in...

2019