pith. machine review for the scientific record. sign in

arxiv: 2509.22630 · v3 · submitted 2025-09-26 · 💻 cs.CL · cs.AI· cs.LG

StateX: Enhancing RNN Recall via Post-training State Expansion

Pith reviewed 2026-05-18 12:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords recurrent neural networksstate expansionpost-trainingrecalllinear attentionstate-space modelsin-context learninglong context
0
0 comments X

The pith

Post-training architectural changes let pre-trained RNNs use larger recurrent states to improve recall with almost no extra parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StateX, a framework that takes already-trained recurrent models and modifies their architecture afterward to enlarge the fixed-size state that holds context. For linear attention and state-space models this enlargement happens with no or negligible growth in total parameter count. Tests on models reaching 1.3 billion parameters show clear gains on recall-heavy and in-context-learning tasks, while training remains cheap and other capabilities stay intact. A reader would care because constant-time RNNs are attractive for long sequences yet are currently limited by how much information their state can retain.

Core claim

StateX demonstrates that post-training architectural modifications can scale up the recurrent state size of pre-trained RNNs with no or negligible increase in model parameters, thereby enhancing recall and in-context learning performance efficiently without high post-training costs or loss of other capabilities.

What carries the argument

StateX post-training framework, consisting of targeted architectural modifications applied to linear attention and state-space model RNNs that increase recurrent state dimension while keeping parameter count nearly constant.

If this is right

  • RNNs become practical for tasks that need accurate retrieval from long contexts without retraining from scratch.
  • The same post-training step works for both linear-attention and state-space families.
  • Performance improves on models as large as 1.3 billion parameters while keeping post-training cost low.
  • Other model abilities remain unchanged after the state expansion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • State size may be the dominant bottleneck for RNN recall, separable from the rest of the learned weights.
  • Similar state-expansion tricks could be tried on other constant-time sequence models beyond the two families tested.
  • If the modifications generalize, practitioners could start with smaller-state models and enlarge them only when needed.

Load-bearing premise

The post-training architectural modifications can scale up the recurrent state size with no or negligible increase in model parameters while preserving the model's other capabilities.

What would settle it

Training a model with StateX and then measuring recall accuracy on long-context tasks shows no improvement, a large rise in parameter count, or clear degradation on unrelated benchmarks.

Figures

Figures reproduced from arXiv: 2509.22630 by Maosong Sun, Xingyu Shen, Xu Han, Yingfa Chen, Zhen Leng Thai, Zhiyuan Liu.

Figure 1
Figure 1. Figure 1: Difference between the traditional pipeline and StateX for training long-context mod [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of StateX (our method) for expanding the state size of linear attention and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance on retrieving specific in￾formation (i.e., a needle) from synthetically gen￾erated long documents up to 64K tokens. Model 4K 8K 16K 32K 64K GLA — Passkey Retrieval Original 0.25 0.01 0.00 0.00 0.00 LPT 0.74 0.41 0.13 0.01 0.01 StateX (ours) 0.93 0.77 0.34 0.06 0.01 Mamba2 — NIAH-Single-2 Original 0.05 0.00 0.00 0.00 0.00 LPT 0.83 0.43 0.30 0.09 0.01 StateX (ours) 0.94 0.61 0.32 0.09 0.00 Common… view at source ↗
Figure 4
Figure 4. Figure 4: Model performance of reinitializa￾tion and parameter inheritance. columns “Params” and “Total State” report the number of model parameters and state parameters for each model, respectively. StateX increases the total state sizes by roughly 50%. The main takeaway is that StateX models achieve the highest average performance, underscoring the advantage of larger states. In-Context Learning [PITH_FULL_IMAGE:… view at source ↗
Figure 5
Figure 5. Figure 5: Model performance under varying numbers of expanded layers. Mamba2 has twice as many layers as GLA because it does not have FFN layers. 2 4 6 8 10 Post-training Tokens (B) 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Training Loss LPT-Mamba2 StateX-Mamba2 LPT-GLA StateX-GLA [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Recurrent neural networks (RNNs), such as linear attention and state-space models, have gained popularity due to their constant per-token complexity when processing long contexts. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a fixed-size recurrent state. Previous studies have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with large recurrent states results in high training costs. In this paper, we introduce StateX, a post-training framework that efficiently expands the states of pre-trained RNNs. For two popular classes of RNNs, linear attention and state-space models, we design post-training architectural modifications in StateX, to scale up the state size with no or negligible increase in model parameters. Experiments on models with up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning performance of RNNs without incurring high post-training costs or compromising other capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces StateX, a post-training framework to expand the recurrent state size of pre-trained RNNs (linear attention and state-space models) via targeted architectural modifications. These changes are claimed to increase state dimension with no or negligible growth in parameter count. Experiments on models up to 1.3B parameters are reported to improve recall and in-context learning performance at low post-training cost while preserving other capabilities.

Significance. If the efficiency claims hold, StateX would offer a practical way to retrofit larger effective state sizes into existing efficient RNN architectures without full retraining, directly addressing the known correlation between recurrent state size and recall ability. This could be valuable for long-context applications where training costs are prohibitive.

major comments (1)
  1. [Architectural modifications for linear attention and SSMs] The central efficiency claim—that recurrent state size can be scaled post-training with 'no or negligible increase in model parameters'—is load-bearing but insufficiently specified. For linear attention the state dimension is tied to key/value projection width; for SSMs it multiplies the input/output matrices. The manuscript must detail the exact construction (initialization of new dimensions, whether added weights are learned or frozen, and measured parameter delta) to substantiate that other capabilities remain uncompromised and post-training cost stays low.
minor comments (1)
  1. [Abstract] The abstract states positive results on 1.3B-parameter models but supplies no quantitative metrics, baselines, or experimental details, making it difficult to evaluate the magnitude of gains or the strength of the 'no compromise' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of StateX for retrofitting larger effective state sizes into existing RNNs. We address the major comment below and will revise the manuscript accordingly to provide the requested clarifications.

read point-by-point responses
  1. Referee: The central efficiency claim—that recurrent state size can be scaled post-training with 'no or negligible increase in model parameters'—is load-bearing but insufficiently specified. For linear attention the state dimension is tied to key/value projection width; for SSMs it multiplies the input/output matrices. The manuscript must detail the exact construction (initialization of new dimensions, whether added weights are learned or frozen, and measured parameter delta) to substantiate that other capabilities remain uncompromised and post-training cost stays low.

    Authors: We agree that additional specification is needed to fully substantiate the efficiency claims. In the revised manuscript we will expand the description of the architectural modifications (currently in Section 3) with the following details: for linear attention, new dimensions in the key/value projections are initialized to zero and the added parameters are trained while the original weights remain frozen; for SSMs, the state matrices are expanded via low-rank updates initialized from the pre-trained weights with the new components trained during post-training. We will also add a table reporting the exact parameter counts before and after expansion (showing <1% growth across the tested models up to 1.3B parameters) together with ablation results confirming that other capabilities are preserved. These additions will clarify both the construction and the low post-training cost. revision: yes

Circularity Check

0 steps flagged

No circularity: StateX claims rest on explicit architectural modifications and experiments

full rationale

The paper proposes concrete post-training modifications to expand recurrent state dimensions for linear attention and SSMs while claiming negligible parameter growth. These modifications are presented as new constructions, then evaluated empirically on models up to 1.3B parameters. No step reduces a prediction or uniqueness claim to a fitted parameter, self-citation, or definitional tautology. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to build on standard RNN architectures with post-training modifications only.

pith-pipeline@v0.9.0 · 5716 in / 1006 out tokens · 39814 ms · 2026-05-18T12:23:43.177485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 12 internal anchors

  1. [1]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christopher R´e. Simple linear attention language models balance the recall-throughput tradeoff. InProceedings of the 41st International Conference on Machine Learning, pp. 1763– 1840, 2024a. Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector...

  2. [2]

    Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun

    URLhttps://arxiv.org/abs/ 2504.18574. Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Stuffed mamba: Oversized states lead to the inability to forget,

  3. [3]

    Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio

    URLhttps://arxiv.org/ abs/2410.07145. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches,

  4. [4]

    On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

    URLhttps://arxiv.or g/abs/1409.1259. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning, pp. 10041– 10071. PMLR,

  5. [5]

    URLhttps://arxiv.org/abs/2502.13685. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Fos- ter, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muen- nighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wa...

  6. [6]

    Albert Gu and Tri Dao

    URLhttps://zenodo.org/records/12608602. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,

  7. [7]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    URLhttps://arxiv.org/abs/2312.00752. Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8): 1735–1780,

  8. [8]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    URLhttps://arxiv.org/abs/2404.06654. Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time,

  9. [9]

    Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach

    URLhttps://arxiv.org/abs/2202.10447. Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Trans- formers are better than state space models at copying. InInternational Conference on Machine Learning, pp. 21502–21521. PMLR, 2024a. 10 Preprint Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after m...

  10. [10]

    Kai Liu, Jianfei Gao, and Kai Chen

    URLhttps://arxiv.or g/abs/2006.16236. Kai Liu, Jianfei Gao, and Kai Chen. Scaling up the state size of RNN LLMs for long-context sce- narios. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 11516–11529...

  11. [11]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer

    URLhttps://arxiv.org/abs/2503.11224. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proc...

  12. [12]

    doi: 10.18653/v1/20 22.emnlp-main.759

    Association for Computational Linguistics. doi: 10.18653/v1/20 22.emnlp-main.759. URLhttps://aclanthology.org/2022.emnlp-main.759/. MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, ...

  13. [13]

    URL https://arxiv.org/abs/2501.08313. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsu...

  14. [14]

    URLhttps://arxiv.org/abs/2305.13048. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, G...

  15. [15]

    URL https://arxiv.org/abs/2404.05892. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. Rwkv-7 ”goose” with expressive dynamic state evolution,

  16. [16]

    Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey

    URLhttps://arxiv.org/abs/2102.11174. Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.https://cere bras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-v ersion-of-redpajama,

  17. [17]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    URLhttps://arxiv.org/ abs/2407.04620. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models,

  18. [18]

    Retentive Network: A Successor to Transformer for Large Language Models

    URL https://arxiv.org/abs/2307.08621. Falcon-LLM Team. The falcon 3 family of open models, December

  19. [19]

    URLhttps://ar xiv.org/abs/1706.03762. Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models,

  20. [20]

    An Empirical Study of Mamba-based Language Models

    URLhttps://arxiv.org/abs/2406.07887. Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory,

  21. [21]

    Test-time regression: a unifying framework for design- ing sequence models with associative memory,

    URLhttps://arxiv.org/ab s/2501.12352. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pp. 56501–56523,

  22. [22]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    URLhttps://arxiv.org/abs/2412.06464. Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right,

  23. [23]

    URLhttps://arxi v.org/abs/2505.23884. Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun.∞bench: Extending long context evaluation beyond 100k tokens,

  24. [24]

    12 Preprint Table 6: Overview of GLA and Mamba2, two popular RNNs with matrix-valued recurrent states

    URLhttps://arxiv.org/abs/2402.13718. 12 Preprint Table 6: Overview of GLA and Mamba2, two popular RNNs with matrix-valued recurrent states. H, P, N, dk, dv are hyperparameters of the architectures.Eis the expansion ratio of StateX for SSMs, which is set to 4, as mentioned in Section 4.2 Model Update rule Query rule State size StateX state size GLA St−1,hd...