Recognition: no theorem link
Priming: Hybrid State Space Models From Pre-trained Transformers
Pith reviewed 2026-05-12 01:08 UTC · model grok-4.3
The pith
Priming initializes hybrid attention-SSM models from pre-trained transformers and recovers performance with under 0.5% of the original training tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Priming converts hybrid architecture design from a full pre-training problem into a knowledge transfer task. It initializes a hybrid attention-plus-SSM model from a pre-trained transformer, then applies short alignment and post-training phases that recover downstream quality using less than 0.5% of the source model's original token budget. The procedure works across dense and mixture-of-experts transformers of varying families and sizes, and it enables controlled scaling experiments that reveal a consistent expressiveness ordering among SSM variants: Gated KalmaNet outperforms Gated DeltaNet, which in turn outperforms Mamba-2, with the ordering directly forecasting long-context reasoning and
What carries the argument
Priming, the procedure of copying transformer weights into a hybrid attention-SSM architecture followed by alignment and post-training phases that transfer knowledge without full retraining.
If this is right
- Hybrid models produced by Priming deliver up to 2.3 times higher decode throughput than the source transformer while remaining within 1% of its quality.
- The expressiveness ranking GKA greater than GDN greater than Mamba-2 directly predicts which hybrid variant performs best on long-context reasoning tasks.
- At 32B scale the primed GKA hybrid improves average reasoning scores by 3.8 points over its source Qwen3-32B model.
- The released model zoo and training code allow other researchers to repeat or extend the same controlled SSM comparisons at scale.
Where Pith is reading between the lines
- Transformer representations appear general enough that recurrent SSM components can be substituted in with only modest additional training.
- The same priming approach might be tested for initializing hybrids that combine attention with other recurrent or memory-efficient blocks beyond the three SSMs compared here.
- Widespread adoption of primed hybrids could shift inference cost curves for long-context applications such as multi-step reasoning and retrieval-augmented generation.
Load-bearing premise
That a hybrid model started from transformer weights can be aligned and post-trained to recover quality without catastrophic forgetting or loss of pre-trained capabilities.
What would settle it
If a primed hybrid model trained on the same post-training data as its source transformer fails to reach within 1% of the transformer's downstream performance on the reported long-context reasoning benchmarks, the recovery claim would be falsified.
Figures
read the original abstract
Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Priming, a knowledge-transfer method that initializes hybrid models (combining attention with SSM layers such as Gated KalmaNet, Gated DeltaNet, or Mamba-2) from pre-trained Transformers. Short alignment and post-training phases (<0.5% of the source pre-training token budget) recover downstream quality. This enables the first controlled, scale-matched comparison of SSM layer types, revealing an expressiveness hierarchy GKA > GDN > Mamba-2 that predicts long-context reasoning performance. The method is shown to be agnostic to source family, scale, and density (dense/MoE); an 8B/32B GKA hybrid improves +3.8 average reasoning points over Qwen3-32B while remaining within 1% of a same-data Transformer baseline and delivering up to 2.3x decode throughput. A model zoo, training/inference code (including Sequence Parallelism and optimized kernels), and vLLM plugin are released under Apache 2.0.
Significance. If the empirical claims hold, Priming materially lowers the barrier to large-scale hybrid architecture exploration by converting it from a full pre-training problem into a transfer problem. The controlled SSM-layer comparison and released artifacts (models, code, kernels) provide immediate value for the community studying efficient long-context models. The reported hierarchy offers a falsifiable prediction that can be tested by others using the released zoo.
major comments (2)
- The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.
- The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.
minor comments (2)
- Clarify the precise definition of 'alignment phase' versus 'post-training phase' and the token budgets allocated to each, ideally with a table or equation.
- The claim of agnosticism to source family/scale is strong; a brief table summarizing results across at least two additional source models (beyond Qwen) would strengthen it.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We are pleased that the significance of Priming for lowering the barrier to hybrid architecture exploration is recognized. We address each major comment below and will update the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: The abstract states concrete gains (+3.8 reasoning points, 2.3x throughput, within 1% of Transformer baseline) but supplies no information on data splits, statistical significance, number of runs, or post-hoc selection criteria. The full manuscript must include these controls (e.g., in the experimental section) for the central claim of successful transfer without catastrophic forgetting to be verifiable.
Authors: We acknowledge that the abstract is necessarily concise and does not include these details. The full manuscript's experimental section and appendices already describe the post-training data (a curated mix of long-context and reasoning corpora totaling less than 0.5% of the source pre-training budget), the fixed evaluation suite, and results from multiple independent runs. To directly address verifiability, we will add an explicit subsection on 'Reproducibility and Statistical Controls' that states the data splits, confirms three independent runs for the 32B-scale results with standard deviations, notes the absence of post-hoc selection, and reiterates that the same data was used for the matched Transformer baseline. This will make the transfer claims fully verifiable without altering any reported numbers. revision: yes
-
Referee: The hierarchy claim (GKA > GDN > Mamba-2 directly predicts downstream performance) is load-bearing for the paper's contribution on controlled comparison. The manuscript should report the exact metrics, context lengths, and task suite used to establish this ordering, together with any ablation that isolates layer type from other architectural differences.
Authors: We agree that the hierarchy is central and must be documented with precision. The controlled comparison already fixes the source Transformer, attention layers, training recipe, and hyperparameters across GKA, GDN, and Mamba-2 variants, isolating the SSM layer. The ordering is established on long-context reasoning performance at 128K context using average accuracy across the Needle-in-Haystack, multi-hop QA, and long-document reasoning tasks that match the 32B evaluation suite. We will expand Section 5 to explicitly enumerate the metrics, context lengths, and task suite, and add a dedicated ablation table that varies only the SSM layer while holding all other factors constant. This will confirm the hierarchy's predictive power for downstream performance. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical knowledge-transfer procedure (Priming) that initializes hybrid SSM-Attention models from existing pre-trained Transformers, followed by short alignment and post-training phases. No equations, derivations, or first-principles results are presented that reduce to their own inputs by construction. The central claims rest on reported downstream performance numbers, model releases, and controlled empirical comparisons of SSM layer types; these are falsifiable experimental outcomes rather than self-referential definitions or fitted parameters renamed as predictions. No load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
GQA : Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895--4901, Singa...
-
[2]
Training-free long-context scaling of large language models
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. In International Conference on Machine Learning, pages 1493--1510. PMLR, 2024
work page 2024
-
[3]
Zoology: Measuring and improving recall in efficient language models
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Re. Zoology: Measuring and improving recall in efficient language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LY3ukUANko
work page 2024
-
[4]
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, and Carole-Jean Wu. Hybrid architectures for language models: Systematic analysis and design insights. arXiv preprint arXiv:2510.04800, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. Society for Industrial and Applied Mathematics, 1994. doi:10.1137/1.9781611971538. URL https://epubs.siam.org/doi/...
-
[6]
arXiv preprint arXiv:2508.14444 (2025)
Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444, 2025
-
[7]
o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \
Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, G \"u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. x LSTM : Extended long short-term memory. Advances in Neural Information Processing Systems, 37: 0 107547--107603, 2024
work page 2024
-
[8]
Transformers to ssms: Distilling quadratic knowledge to subquadratic models
Aviv Bick, Kevin Li, Eric Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. Advances in neural information processing systems, 37: 0 31788--31812, 2024
work page 2024
-
[9]
Nvidia nemotron 3: Efficient and open intelligence, 2025
Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025 a
-
[10]
Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, et al. Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning. arXiv preprint arXiv:2512.20848, 2025 b
-
[11]
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report. arXiv preprint arXiv:2603.00729, 2026
-
[12]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025 a
work page internal anchor Pith review arXiv 2025
-
[13]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun. Scienceagentbench: Toward rigorous assessment of language agents for data-driven scientific discovery. In ...
work page 2025
-
[15]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J
Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Christian Norgaard, Nayantara Mudur, Martyna Beata Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian A Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqth...
work page 2025
-
[17]
Transformer-xl: Attentive language models beyond a fixed-length context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 2978--2988, 2019
work page 2019
-
[18]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In International Conference on Machine Learning, pages 10041--10071. PMLR, 2024
work page 2024
-
[19]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in neural information processing systems, 35: 0 16344--16359, 2022
work page 2022
-
[20]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Fewer truncations improve language modeling
Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. Fewer truncations improve language modeling. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=kRxCDDFNpp
work page 2024
-
[22]
Hymba: A hybrid-head architecture for small language models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, ZIJIA CHEN, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Celine Lin, Jan Kautz, and Pavlo Molchanov. Hymba: A hybrid-head architecture for small language models. In The Thirteenth International Conference on Learning Representations, 2025. URL ht...
work page 2025
-
[23]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021
work page 2021
-
[24]
AREAL : A large-scale asynchronous reinforcement learning system for language reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, WANG JIASHU, Tongkai Yang, Binhang Yuan, and Yi Wu. AREAL : A large-scale asynchronous reinforcement learning system for language reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id...
work page 2026
-
[25]
Extending the context of pretrained llms by dropping their positional embeddings
Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained llms by dropping their positional embeddings. arXiv preprint arXiv:2512.12167, 2025
-
[26]
Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model, 2024. URL https://arxiv.org/abs/2405.16712
-
[27]
RADLADS : Rapid attention distillation to linear attention decoders at scale
Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. RADLADS : Rapid attention distillation to linear attention decoders at scale. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=38GehGepDd
work page 2025
-
[28]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Combining recurrent, convolutional, and continuous-time models with linear state space layers
Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher R \'e . Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34: 0 572--585, 2021
work page 2021
-
[31]
Efficiently modeling long sequences with structured state spaces
Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC
work page 2022
-
[32]
Jet-nemotron: Efficient language model with post neural architecture search
Yuxian Gu, Qinghao Hu, Haocheng Xi, Junyu Chen, Shang Yang, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[33]
A survey of model reduction by balanced truncation and some new results
Serkan Gugercin and Athanasios C Antoulas. A survey of model reduction by balanced truncation and some new results. International Journal of Control, 77 0 (8): 0 748--766, 2004
work page 2004
-
[34]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[35]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. RULER : What s the real context size of your long-context language models? In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=kIoBbc76Sy
work page 2024
-
[36]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL
work page 2025
-
[38]
Samy Jelassi, David Brandfonbrener, Sham M. Kakade, and eran malach. Repeat after me: Transformers are better than state space models at copying. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=duRRoGeoQT
work page 2024
-
[39]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[40]
R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82 0 (1): 0 35--45, 1960
work page 1960
-
[41]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020
work page 2020
-
[42]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB
work page 2020
-
[43]
BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Igorevich Sorokin, Artyom Sorokin, and Mikhail Burtsev. BABIL ong: Testing the limits of LLM s with long context reasoning-in-a-haystack. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=u7m2CG84BQ
work page 2024
-
[44]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[45]
Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James Validad Miranda, Alisa Liu, Nouha Dziri, Xinxi Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Christopher Wilhelm, Luca Soldaini, Noah A. Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh...
work page 2025
-
[46]
Liger: Linearizing large language models to gated recurrent structures
Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures. In International Conference on Machine Learning, pages 32452--32466. PMLR, 2025
work page 2025
-
[47]
Distilling to hybrid attention models via kl-guided layer selection
Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection. arXiv preprint arXiv:2512.20569, 2025
-
[48]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review arXiv 2024
-
[49]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214--3252, 2022
work page 2022
-
[50]
On the stochastic realization problem
Anders Lindquist and Giorgio Picci. On the stochastic realization problem. SIAM Journal on Control and Optimization, 17 0 (3): 0 365--389, 1979
work page 1979
-
[51]
Ringattention with blockwise transformers for near-infinite context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringattention with blockwise transformers for near-infinite context. In International Conference on Learning Representations, volume 2024, pages 3992--4008, 2024
work page 2024
-
[52]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in neural information processing systems, 36: 0 21558--21572, 2023
work page 2023
-
[53]
PICASO : Permutation-invariant context composition with state space models
Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, and Stefano Soatto. PICASO : Permutation-invariant context composition with state space models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=88TC1AWV27
work page 2025
-
[54]
L. Ljung. System Identification: Theory for the User. Prentice Hall information and system sciences series. Prentice Hall PTR, 1999. ISBN 9780136566953. URL https://books.google.com/books?id=nHFoQgAACAAJ
work page 1999
-
[55]
Error propagation properties of recursive least-squares adaptation algorithms
Stefan Ljung and Lennart Ljung. Error propagation properties of recursive least-squares adaptation algorithms. Automatica, 21 0 (2): 0 157--167, 1985. ISSN 0005-1098. doi:https://doi.org/10.1016/0005-1098(85)90110-4. URL https://www.sciencedirect.com/science/article/pii/0005109885901104
-
[56]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022
-
[57]
AMC/AIME : MAA invitational competitions, 2025
Mathematical Association of America . AMC/AIME : MAA invitational competitions, 2025. URL https://maa.org/maa-invitational-competitions/. Accessed: 2025
work page 2025
-
[58]
Linearizing large language models
Jean Mercat, Igor Vasiljevic, Sedrick Scott Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=soGxskHGox
work page 2024
-
[59]
Landmark attention: Random-access infinite context length for transformers
Amirkeivan Mohtashami and Martin Jaggi. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=PkoGERXS1B
work page 2023
-
[60]
Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 101, 2024
-
[61]
Arthur G. O. Mutambara. Decentralized Estimation and Control for Multisensor Systems. CRC Press, 1998
work page 1998
-
[62]
Expansion span: Combining fading memory and retrieval in hybrid state space models
Elvis Nunez, Luca Zancato, Benjamin Bowman, Aditya Golatkar, Wei Xia, and Stefano Soatto. Expansion span: Combining fading memory and retrieval in hybrid state space models. In International Conference on Neuro-symbolic Systems, pages 570--596. PMLR, 2025
work page 2025
-
[63]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023
work page 2023
-
[65]
Marconi: Prefix caching for the era of hybrid LLM s
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid LLM s. In Eighth Conference on Machine Learning and Systems, 2025. URL https://openreview.net/forum?id=RUaMUu7vMX
work page 2025
-
[66]
Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E
Shishir G. Patil, Huanzhi Mao, Charlie Cheng-Jie Ji, Fanjia Yan, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[67]
Time-Varying Systems and Computations
Alle-Jan Veen Patrick Dewilde. Time-Varying Systems and Computations. Springer New York, NY, 1998. doi:10.1007/978-1-4757-2817-0
-
[68]
Ya RN : Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u
work page 2024
-
[69]
Gated kalmanet: A fading memory layer through test-time ridge regression
Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, and Stefano Soatto. Gated kalmanet: A fading memory layer through test-time ridge regression. arXiv preprint arXiv:2511.21016, 2025
-
[70]
Generalizing verifiable instruction following, 2025
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following, 2025
work page 2025
-
[71]
Ulysses sequence parallelism in the hugging face ecosystem, 2025
Kashif Rasul and Bekman Stas. Ulysses sequence parallelism in the hugging face ecosystem, 2025. URL https://huggingface.co/blog/ulysses-sp. Blog post
work page 2025
-
[72]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98
work page 2024
-
[73]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Liliang Ren, Yang Liu, Yadong Lu, yelong shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bIlnpVM4bc
work page 2025
-
[74]
Understanding and improving length generalization in recurrent models, 2025
Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models, 2025. URL https://arxiv.org/abs/2507.02782
-
[75]
H. Sandberg and A. Rantzer. Balanced truncation of linear time-varying systems. IEEE Transactions on Automatic Control, 49 0 (2): 0 217--229, 2004. doi:10.1109/TAC.2003.822862
-
[76]
A.H. Sayed. Fundamentals of Adaptive Filtering. IEEE Press. Wiley, 2003. ISBN 9780471461265. URL https://books.google.com/books?id=VaAV4uqMuKYC
work page 2003
-
[77]
A.H. Sayed. Adaptive Filters. IEEE Press. Wiley, 2011. ISBN 9781118210840. URL https://books.google.com/books?id=VBaenqIVftUC
work page 2011
-
[78]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20
work page 2024
-
[79]
HybridFlow: A Flexible and Efficient RLHF Framework , url=
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, page 1279–1297, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400711961. doi:1...
-
[80]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.