arxiv: 2509.22630 · v3 · submitted 2025-09-26 · 💻 cs.CL · cs.AI· cs.LG

StateX: Enhancing RNN Recall via Post-training State Expansion

Xingyu Shen , Yingfa Chen , Zhen Leng Thai , Xu Han , Zhiyuan Liu , Maosong Sun This is my paper

Pith reviewed 2026-05-18 12:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords recurrent neural networksstate expansionpost-trainingrecalllinear attentionstate-space modelsin-context learninglong context

0 comments

The pith

Post-training architectural changes let pre-trained RNNs use larger recurrent states to improve recall with almost no extra parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents StateX, a framework that takes already-trained recurrent models and modifies their architecture afterward to enlarge the fixed-size state that holds context. For linear attention and state-space models this enlargement happens with no or negligible growth in total parameter count. Tests on models reaching 1.3 billion parameters show clear gains on recall-heavy and in-context-learning tasks, while training remains cheap and other capabilities stay intact. A reader would care because constant-time RNNs are attractive for long sequences yet are currently limited by how much information their state can retain.

Core claim

StateX demonstrates that post-training architectural modifications can scale up the recurrent state size of pre-trained RNNs with no or negligible increase in model parameters, thereby enhancing recall and in-context learning performance efficiently without high post-training costs or loss of other capabilities.

What carries the argument

StateX post-training framework, consisting of targeted architectural modifications applied to linear attention and state-space model RNNs that increase recurrent state dimension while keeping parameter count nearly constant.

If this is right

RNNs become practical for tasks that need accurate retrieval from long contexts without retraining from scratch.
The same post-training step works for both linear-attention and state-space families.
Performance improves on models as large as 1.3 billion parameters while keeping post-training cost low.
Other model abilities remain unchanged after the state expansion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

State size may be the dominant bottleneck for RNN recall, separable from the rest of the learned weights.
Similar state-expansion tricks could be tried on other constant-time sequence models beyond the two families tested.
If the modifications generalize, practitioners could start with smaller-state models and enlarge them only when needed.

Load-bearing premise

The post-training architectural modifications can scale up the recurrent state size with no or negligible increase in model parameters while preserving the model's other capabilities.

What would settle it

Training a model with StateX and then measuring recall accuracy on long-context tasks shows no improvement, a large rise in parameter count, or clear degradation on unrelated benchmarks.

Figures

Figures reproduced from arXiv: 2509.22630 by Maosong Sun, Xingyu Shen, Xu Han, Yingfa Chen, Zhen Leng Thai, Zhiyuan Liu.

**Figure 2.** Figure 2: Illustration of StateX (our method) for expanding the state size of linear attention and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Performance on retrieving specific information (i.e., a needle) from synthetically generated long documents up to 64K tokens. Model 4K 8K 16K 32K 64K GLA — Passkey Retrieval Original 0.25 0.01 0.00 0.00 0.00 LPT 0.74 0.41 0.13 0.01 0.01 StateX (ours) 0.93 0.77 0.34 0.06 0.01 Mamba2 — NIAH-Single-2 Original 0.05 0.00 0.00 0.00 0.00 LPT 0.83 0.43 0.30 0.09 0.01 StateX (ours) 0.94 0.61 0.32 0.09 0.00 Common… view at source ↗

**Figure 4.** Figure 4: Model performance of reinitialization and parameter inheritance. columns “Params” and “Total State” report the number of model parameters and state parameters for each model, respectively. StateX increases the total state sizes by roughly 50%. The main takeaway is that StateX models achieve the highest average performance, underscoring the advantage of larger states. In-Context Learning [PITH_FULL_IMAGE:… view at source ↗

**Figure 5.** Figure 5: Model performance under varying numbers of expanded layers. Mamba2 has twice as many layers as GLA because it does not have FFN layers. 2 4 6 8 10 Post-training Tokens (B) 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Training Loss LPT-Mamba2 StateX-Mamba2 LPT-GLA StateX-GLA [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Recurrent neural networks (RNNs), such as linear attention and state-space models, have gained popularity due to their constant per-token complexity when processing long contexts. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a fixed-size recurrent state. Previous studies have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with large recurrent states results in high training costs. In this paper, we introduce StateX, a post-training framework that efficiently expands the states of pre-trained RNNs. For two popular classes of RNNs, linear attention and state-space models, we design post-training architectural modifications in StateX, to scale up the state size with no or negligible increase in model parameters. Experiments on models with up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning performance of RNNs without incurring high post-training costs or compromising other capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

StateX claims a post-training expansion of recurrent states in linear attention and SSMs with almost no extra parameters, yielding recall gains on 1.3B models, but the exact construction for keeping parameter count flat is the part that needs checking. The paper introduces a framework that modifies pre-trained models to support larger states after the fact, avoiding the cost of training large-state versions from scratch. Experiments up to 1.3B parameters show improvements in recall and in-context learning tasks, with low additional post-training expense and no major loss in other capabilities. This is a direct response to the known correlation between state size and recall ability in these constant-time models. The post-training angle is the clearest new element, since prior work has mostly focused on training larger states end-to-end. The scale of the tests is reasonable and gives the results some weight. The soft spot sits in the mechanism itself. Expanding the recurrent state normally enlarges projection matrices or state transition weights, so any claim of negligible parameter growth requires a specific trick such as selective initialization, sharing, or freezing. The abstract asserts this is achieved, but without the precise construction, initialization details, and measured parameter delta laid out, it is hard to judge whether the efficiency and capability-preservation claims fully hold. If the full paper supplies a reproducible recipe and shows the actual overhead is tiny, that concern shrinks. This work targets researchers building or adapting efficient long-context models where transformers are too expensive. Readers looking for practical upgrades to existing RNN checkpoints would get the most from it. It has enough empirical grounding and a focused idea to merit a serious referee, even if the mechanism section may need expansion or clarification during review. I would recommend sending it out for peer review.

Referee Report

1 major / 1 minor

Summary. The paper introduces StateX, a post-training framework to expand the recurrent state size of pre-trained RNNs (linear attention and state-space models) via targeted architectural modifications. These changes are claimed to increase state dimension with no or negligible growth in parameter count. Experiments on models up to 1.3B parameters are reported to improve recall and in-context learning performance at low post-training cost while preserving other capabilities.

Significance. If the efficiency claims hold, StateX would offer a practical way to retrofit larger effective state sizes into existing efficient RNN architectures without full retraining, directly addressing the known correlation between recurrent state size and recall ability. This could be valuable for long-context applications where training costs are prohibitive.

major comments (1)

[Architectural modifications for linear attention and SSMs] The central efficiency claim—that recurrent state size can be scaled post-training with 'no or negligible increase in model parameters'—is load-bearing but insufficiently specified. For linear attention the state dimension is tied to key/value projection width; for SSMs it multiplies the input/output matrices. The manuscript must detail the exact construction (initialization of new dimensions, whether added weights are learned or frozen, and measured parameter delta) to substantiate that other capabilities remain uncompromised and post-training cost stays low.

minor comments (1)

[Abstract] The abstract states positive results on 1.3B-parameter models but supplies no quantitative metrics, baselines, or experimental details, making it difficult to evaluate the magnitude of gains or the strength of the 'no compromise' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of StateX for retrofitting larger effective state sizes into existing RNNs. We address the major comment below and will revise the manuscript accordingly to provide the requested clarifications.

read point-by-point responses

Referee: The central efficiency claim—that recurrent state size can be scaled post-training with 'no or negligible increase in model parameters'—is load-bearing but insufficiently specified. For linear attention the state dimension is tied to key/value projection width; for SSMs it multiplies the input/output matrices. The manuscript must detail the exact construction (initialization of new dimensions, whether added weights are learned or frozen, and measured parameter delta) to substantiate that other capabilities remain uncompromised and post-training cost stays low.

Authors: We agree that additional specification is needed to fully substantiate the efficiency claims. In the revised manuscript we will expand the description of the architectural modifications (currently in Section 3) with the following details: for linear attention, new dimensions in the key/value projections are initialized to zero and the added parameters are trained while the original weights remain frozen; for SSMs, the state matrices are expanded via low-rank updates initialized from the pre-trained weights with the new components trained during post-training. We will also add a table reporting the exact parameter counts before and after expansion (showing <1% growth across the tested models up to 1.3B parameters) together with ablation results confirming that other capabilities are preserved. These additions will clarify both the construction and the low post-training cost. revision: yes

Circularity Check

0 steps flagged

No circularity: StateX claims rest on explicit architectural modifications and experiments

full rationale

The paper proposes concrete post-training modifications to expand recurrent state dimensions for linear attention and SSMs while claiming negligible parameter growth. These modifications are presented as new constructions, then evaluated empirically on models up to 1.3B parameters. No step reduces a prediction or uniqueness claim to a fitted parameter, self-citation, or definitional tautology. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to build on standard RNN architectures with post-training modifications only.

pith-pipeline@v0.9.0 · 5716 in / 1006 out tokens · 39814 ms · 2026-05-18T12:23:43.177485+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on models with up to 1.3B parameters demonstrate that StateX efficiently enhances the recall...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, James Zou, Atri Rudra, and Christopher R´e. Simple linear attention language models balance the recall-throughput tradeoff. InProceedings of the 41st International Conference on Machine Learning, pp. 1763– 1840, 2024a. Simran Arora, Aman Timalsina, Aaryan Singhal, Benjamin Spector...

work page arXiv
[2]

Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun

URLhttps://arxiv.org/abs/ 2504.18574. Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Stuffed mamba: Oversized states lead to the inability to forget,

work page arXiv
[3]

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio

URLhttps://arxiv.org/ abs/2410.07145. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches,

work page arXiv
[4]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

URLhttps://arxiv.or g/abs/1409.1259. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. InInternational Conference on Machine Learning, pp. 10041– 10071. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URLhttps://arxiv.org/abs/2502.13685. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Fos- ter, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muen- nighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wa...

work page arXiv
[6]

Albert Gu and Tri Dao

URLhttps://zenodo.org/records/12608602. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces,

work page arXiv
[7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URLhttps://arxiv.org/abs/2312.00752. Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8): 1735–1780,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

URLhttps://arxiv.org/abs/2404.06654. Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer quality in linear time,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach

URLhttps://arxiv.org/abs/2202.10447. Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Trans- formers are better than state space models at copying. InInternational Conference on Machine Learning, pp. 21502–21521. PMLR, 2024a. 10 Preprint Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and Eran Malach. Repeat after m...

work page arXiv
[10]

Kai Liu, Jianfei Gao, and Kai Chen

URLhttps://arxiv.or g/abs/2006.16236. Kai Liu, Jianfei Gao, and Kai Chen. Scaling up the state size of RNN LLMs for long-context sce- narios. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (eds.),Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pp. 11516–11529...

work page arXiv 2006
[11]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer

URLhttps://arxiv.org/abs/2503.11224. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proc...

work page arXiv 2022
[12]

doi: 10.18653/v1/20 22.emnlp-main.759

Association for Computational Linguistics. doi: 10.18653/v1/20 22.emnlp-main.759. URLhttps://aclanthology.org/2022.emnlp-main.759/. MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, ...

work page doi:10.18653/v1/20 2022
[13]

URL https://arxiv.org/abs/2501.08313. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsu...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

URLhttps://arxiv.org/abs/2305.13048. Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemysław Kazienko, Kranthi Kiran GV , Jan Koco ´n, Bartłomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Jiaju Lin, Niklas Muennighoff, Fares Obeid, Atsushi Saito, G...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

URL https://arxiv.org/abs/2404.05892. Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, and Christian Zhou-Zheng. Rwkv-7 ”goose” with expressive dynamic state evolution,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey

URLhttps://arxiv.org/abs/2102.11174. Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.https://cere bras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-v ersion-of-redpajama,

work page arXiv
[17]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

URLhttps://arxiv.org/ abs/2407.04620. Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Retentive Network: A Successor to Transformer for Large Language Models

URL https://arxiv.org/abs/2307.08621. Falcon-LLM Team. The falcon 3 family of open models, December

work page internal anchor Pith review Pith/arXiv arXiv
[19]

URLhttps://ar xiv.org/abs/1706.03762. Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, and Bryan Catanzaro. An empirical study of mamba-based language models,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

An Empirical Study of Mamba-based Language Models

URLhttps://arxiv.org/abs/2406.07887. Ke Alexander Wang, Jiaxin Shi, and Emily B. Fox. Test-time regression: a unifying framework for designing sequence models with associative memory,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Test-time regression: a unifying framework for design- ing sequence models with associative memory,

URLhttps://arxiv.org/ab s/2501.12352. Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. InInternational Conference on Machine Learning, pp. 56501–56523,

work page arXiv
[22]

Gated Delta Networks: Improving Mamba2 with Delta Rule

URLhttps://arxiv.org/abs/2412.06464. Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, and Hao Tan. Test-time training done right,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URLhttps://arxi v.org/abs/2505.23884. Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun.∞bench: Extending long context evaluation beyond 100k tokens,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

12 Preprint Table 6: Overview of GLA and Mamba2, two popular RNNs with matrix-valued recurrent states

URLhttps://arxiv.org/abs/2402.13718. 12 Preprint Table 6: Overview of GLA and Mamba2, two popular RNNs with matrix-valued recurrent states. H, P, N, dk, dv are hyperparameters of the architectures.Eis the expansion ratio of StateX for SSMs, which is set to 4, as mentioned in Section 4.2 Model Update rule Query rule State size StateX state size GLA St−1,hd...

work page arXiv 2024