Recognition: no theorem link
Attention to Mamba: A Recipe for Cross-Architecture Distillation
Pith reviewed 2026-05-13 23:01 UTC · model grok-4.3
The pith
Distilling a Transformer into Mamba through an intermediate linearized attention stage preserves the teacher's downstream performance with a perplexity of 14.11 versus 13.86.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By first distilling knowledge from a traditional Transformer into a linearized version of Attention using an adaptation of the kernel trick, and then distilling the linearized version into an adapted Mamba model that does not use any Attention block, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86.
What carries the argument
The two-stage distillation pipeline that routes knowledge from Transformer through a kernel-trick-adapted linearized attention model into a final Mamba student equipped with principled initialization.
Load-bearing premise
The intermediate linearized-attention model retains enough of the teacher's representational power to serve as an effective teacher for the final Mamba stage.
What would settle it
A direct single-stage distillation from the Transformer to Mamba achieving comparable perplexity of 14.11 without the intermediate linearized attention step would falsify the necessity of the two-stage recipe.
read the original abstract
State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a two-stage cross-architecture distillation recipe to transfer knowledge from a pretrained Transformer (Pythia-1B) to a pure Mamba model. Stage 1 distills the Transformer into a linearized attention model via a kernel-trick adaptation; Stage 2 then distills the linearized model into an adapted Mamba architecture without attention blocks. The authors claim that principled initialization via this pipeline recovers performance close to the teacher (perplexity 14.11 vs. 13.86) on downstream tasks, outperforming naive distillation, supported by ablations on sequence-mixer variants, scaling with model size and tokens, and token-allocation sensitivity at 1B scale with 10B tokens.
Significance. If the empirical results hold under scrutiny, the work provides a concrete, non-hybrid recipe for leveraging existing Transformer pretraining to initialize efficient SSMs, addressing a known limitation in direct Transformer-to-Mamba distillation. The scaling and ablation analyses add value by exploring practical hyperparameters, though the absence of intermediate-stage metrics limits attribution of gains specifically to the proposed initialization.
major comments (2)
- [Two-stage distillation procedure] The load-bearing assumption is that the Stage-1 linearized-attention model retains sufficient representational capacity from the original Pythia-1B teacher. No perplexity, downstream scores, or attention-pattern metrics are reported for this intermediate model, so it is impossible to verify whether the kernel-trick adaptation introduces non-negligible distortion before Stage 2 begins. An ablation reporting Stage-1 performance (or an oracle comparison) is required to substantiate the two-stage claim.
- [Experimental results] Table or figure reporting final perplexity (14.11 vs. 13.86) lacks error bars, standard deviations across seeds, or statistical significance tests. Given the small gap and the reader's note on missing loss formulations and data-selection details, reproducibility of the central result cannot be assessed from the provided evidence.
minor comments (2)
- [Abstract] The abstract and methods would benefit from an explicit equation or pseudocode block defining the kernel-trick adaptation objective and the Mamba initialization derived from it.
- [Ablations] Ablation figures on token allocation between stages should include a baseline that allocates all tokens to a single-stage Mamba distillation for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Two-stage distillation procedure] The load-bearing assumption is that the Stage-1 linearized-attention model retains sufficient representational capacity from the original Pythia-1B teacher. No perplexity, downstream scores, or attention-pattern metrics are reported for this intermediate model, so it is impossible to verify whether the kernel-trick adaptation introduces non-negligible distortion before Stage 2 begins. An ablation reporting Stage-1 performance (or an oracle comparison) is required to substantiate the two-stage claim.
Authors: We agree that the performance of the intermediate Stage-1 model is important to substantiate the two-stage claim and to quantify any potential distortion from the kernel-trick adaptation. In the revised manuscript we will add a dedicated section and table reporting perplexity and downstream task scores for the linearized attention model immediately after Stage 1. This will allow direct verification that sufficient representational capacity is retained before distillation into the Mamba architecture proceeds. revision: yes
-
Referee: [Experimental results] Table or figure reporting final perplexity (14.11 vs. 13.86) lacks error bars, standard deviations across seeds, or statistical significance tests. Given the small gap and the reader's note on missing loss formulations and data-selection details, reproducibility of the central result cannot be assessed from the provided evidence.
Authors: We acknowledge the need for greater statistical transparency and reproducibility details. In the revision we will expand the experimental section and appendix to include the precise loss formulations, data-selection criteria, and token-allocation procedure. Because of the substantial compute required for 1B-scale runs, the main results were obtained with a single seed; however, we will add multi-seed variance estimates for the smaller-scale ablations and scaling experiments already present in the paper. We believe the consistent advantage over naive distillation across ablations and model sizes still supports the central claim, but the added details will improve assessability of the reported perplexity gap. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's central claims rest on empirical outcomes from a two-stage distillation procedure (Transformer to linearized attention via kernel-trick adaptation, then to Mamba), with reported metrics such as perplexity 14.11 vs. teacher's 13.86 arising from training runs and ablations rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce the results to tautological inputs by construction. The recipe is presented as a procedural method validated by experiments at 1B scale, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In3rd International Conference on Learning Representations, ICLR 2015,
work page 2015
-
[2]
Longformer: The Long-Document Transformer
URLhttps: //arxiv.org/abs/2004.05150. Aviv Bick, Kevin Li, Eric P. Xing, J Zico Kolter, and Albert Gu. Transformers to SSMs: Distilling quadratic knowledge to subquadratic models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
URLhttps://arxiv.org/abs/2502.14458. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, pp...
-
[4]
Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner
URLhttps://arxiv.org/abs/2502.08606. Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. Analyzing the evolution and mainte- nance of ml models on hugging face,
-
[5]
URLhttps://arxiv.org/abs/2311.13380. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Re- thinking attention with performers,
-
[6]
Rethinking Attention with Performers
URLhttps://arxiv.org/abs/2009.14794. Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[7]
URLhttps://arxiv.org/abs/2402.19047. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,
-
[8]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
URLhttps://arxiv.org/abs/1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[9]
Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
doi:10.5281/zenodo.5371628 , url =
URLhttps: //doi.org/10.5281/zenodo.5371628. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. OpenWebText corpus.http://Skylion007.github. io/OpenWebTextCorpus,
-
[11]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
ISSN 1573-1405. doi: 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s11263-021-01453-z
-
[12]
Albert Gu, Karan Goel, and Christopher Ré
URLhttps://arxiv.org/abs/2008.07669. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022a. URL https://arxiv.org/abs/2111.00396. Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initialization of diagonal state space models, 2022b. URLhttps://arxiv.org/abs/220...
-
[13]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean
URLhttps://arxiv.org/abs/2410.06846. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,
-
[14]
Distilling the Knowledge in a Neural Network
URLhttps: //arxiv.org/abs/1503.02531. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
LoRA: Low-Rank Adaptation of Large Language Models
URLhttps://arxiv.org/abs/2106.09685. Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.),Proceedings of the 39th Interna- tional Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 9...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret
URLhttps://arxiv.org/abs/2103.13076. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autore- gressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR,
-
[17]
RACE: Large-scale ReAding comprehension dataset from examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.),Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September
work page 2017
-
[18]
RACE : Large-scale R e A ding Comprehension Dataset From Examinations
Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URLhttps://aclanthology.org/ D17-1082. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning,
-
[19]
URLhttps://arxiv.org/abs/2503.12067. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights,
-
[20]
URLhttps://arxiv. org/abs/2210.04243. Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models,
-
[21]
URLhttps://arxiv.org/abs/2405.06640. J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446,
-
[22]
URLhttps://arxiv.org/abs/1606.06031. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsush...
-
[23]
RWKV: Reinventing RNNs for the Transformer Era
URLhttps://arxiv.org/abs/2305.13048. Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Random feature attention.arXiv preprint arXiv:2103.02143,
URLhttps://arxiv.org/abs/2103.02143. 12 Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pp. 28043–28078. PMLR,
-
[25]
Issue #120, Accessed: 2025-08-01. Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention,
work page 2025
-
[26]
URLhttps://arxiv.org/abs/2202.08791. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free,
-
[27]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
URLhttps://arxiv.org/abs/2505.06708. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi
URLhttps: //arxiv.org/abs/2401.17574. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,
-
[29]
Social IQa: Commonsense reasoning about social interactions
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing...
work page 2019
-
[30]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URLhttps://aclanthology.org/D19-1454/. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d19-1454 1909
-
[31]
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma
URLhttps://arxiv.org/abs/2408.15237. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,
-
[32]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
work page 2020
-
[33]
URLhttps://www.aclweb.org/anthology/2020.emnlp-demos.6
Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/2020.emnlp-demos.6. Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nys- trömformer: A nyström-based algorithm for approximating self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 35, pp...
work page 2020
-
[34]
URL https://arxiv.org/abs/2407.01885. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences,
-
[35]
Big bird: Transformers for longer sequences, 2020
URL https://arxiv.org/abs/2007.14062. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
-
[36]
13 Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347,
-
[37]
Lolcats: On low-rank linearizing of large language models
URLhttps://arxiv.org/abs/ 2410.10254. Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. A robustly optimized BERT pre-training approach with post-training. In Sheng Li, Maosong Sun, Yang Liu, Hua Wu, Kang Liu, Wanxiang Che, Shizhu He, and Gaoqi Rao (eds.), Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp. 1218–1227, Huhhot, ...
-
[38]
URL https://aclanthology.org/2021.ccl-1
Chinese Information Processing Society of China. URL https://aclanthology.org/2021.ccl-1. 108/. 14 A Additional results and implementation details A.1 Expanded Results Summary In this section we expand the results in Sec. 4 and find that: • Increasing the scale of the architectures (from 160M, to 410M and 1B) reduces PPL and increases all other performanc...
work page 2021
-
[39]
3LM harness evaluation estimates the Standard Error of the Mean (SEM) using bootstrap resampling: it repeatedly sam- ples a set of multiple choice questions (for a specified number of bootstrap iterations, 100K in our case) and then calcu- lates the SEM of the metric scores obtained from these samples. Relevant code for this error computation can be found...
work page 2024
-
[40]
Following prior work (Gu & Dao, 2023; Dao & Gu, 2024), we use gradient clipping of 1.0 and weight decay 0.1. B Additional Architecture details B.1 Full architectures schematics For reference, the diagrams in Fig. 4 describe the complete architectures discussed in this project. The Pythia Transformer Biderman et al. (2023), which is the teacher model used ...
work page 2023
-
[41]
18 C Pseudocode Here we provide pseudocode for the application of the forward pass of the student model used in our experi- ments. More in detail, Lst. 1 reports the implementation of the whole adapted Pythia block Biderman et al. (2023). As in the original Pythia model, the flow of operations in this block is split into two branches (see also Fig. 4). On...
work page 2023
-
[42]
Its code blueprint closely follows the one for the Mamba SSM mixer Gu & Dao (2023), including a gate branch and a short convolution before the mixer application, but presents three main differences: (i) the SSM parameters B, C (covering the roles ofkeys and queries in Linear Attention) are further modified according to the Hedgehog map; (ii) the input to ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.