pith. machine review for the scientific record. sign in

arxiv: 2604.14191 · v1 · submitted 2026-04-01 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Attention to Mamba: A Recipe for Cross-Architecture Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords cross-architecture distillationMambaTransformerstate space modelslinearized attentionkernel trickknowledge distillationperplexity
0
0 comments X

The pith

Distilling a Transformer into Mamba through an intermediate linearized attention stage preserves the teacher's downstream performance with a perplexity of 14.11 versus 13.86.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that direct distillation from a Transformer to Mamba loses performance, but a two-stage process recovers most of it. Knowledge first moves from the Transformer to a linearized attention model via a kernel-trick adaptation, then from that intermediate model to an adapted Mamba student with no attention blocks. This matters because Mamba offers lower memory use and higher generation speed than Transformers, so an effective transfer recipe lets practitioners reuse abundant pretrained Transformer checkpoints instead of training Mamba from scratch. Experiments at the 1B-parameter scale with 10 billion distillation tokens confirm the distilled Mamba nearly matches the Pythia teacher on downstream tasks, supported by ablations on architecture choices, model scaling, and token allocation between stages.

Core claim

By first distilling knowledge from a traditional Transformer into a linearized version of Attention using an adaptation of the kernel trick, and then distilling the linearized version into an adapted Mamba model that does not use any Attention block, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86.

What carries the argument

The two-stage distillation pipeline that routes knowledge from Transformer through a kernel-trick-adapted linearized attention model into a final Mamba student equipped with principled initialization.

Load-bearing premise

The intermediate linearized-attention model retains enough of the teacher's representational power to serve as an effective teacher for the final Mamba stage.

What would settle it

A direct single-stage distillation from the Transformer to Mamba achieving comparable perplexity of 14.11 without the intermediate linearized attention step would falsify the necessity of the two-stage recipe.

read the original abstract

State Space Models (SSMs) such as Mamba have become a popular alternative to Transformer models, due to their reduced memory consumption and higher throughput at generation compared to their Attention-based counterparts. On the other hand, the community has built up a considerable body of knowledge on how to train Transformers, and many pretrained Transformer models are readily available. To facilitate the adoption of SSMs while leveraging existing pretrained Transformers, we aim to identify an effective recipe to distill an Attention-based model into a Mamba-like architecture. In prior work on cross-architecture distillation, however, it has been shown that a na\"ive distillation procedure from Transformers to Mamba fails to preserve the original teacher performance, a limitation often overcome with hybrid solutions combining Attention and SSM blocks. The key argument from our work is that, by equipping Mamba with a principled initialization, we can recover an overall better recipe for cross-architectural distillation. To this end, we propose a principled two-stage approach: first, we distill knowledge from a traditional Transformer into a linearized version of Attention, using an adaptation of the kernel trick. Then, we distill the linearized version into an adapted Mamba model that does not use any Attention block. Overall, the distilled Mamba model is able to preserve the original Pythia-1B Transformer performance in downstream tasks, maintaining a perplexity of 14.11 close to the teacher's 13.86. To show the efficacy of our recipe, we conduct thorough ablations at 1B scale with 10B tokens varying sequence mixer architecture, scaling analysis on model sizes and total distillation tokens, and a sensitivity analysis on tokens allocation between stages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a two-stage cross-architecture distillation recipe to transfer knowledge from a pretrained Transformer (Pythia-1B) to a pure Mamba model. Stage 1 distills the Transformer into a linearized attention model via a kernel-trick adaptation; Stage 2 then distills the linearized model into an adapted Mamba architecture without attention blocks. The authors claim that principled initialization via this pipeline recovers performance close to the teacher (perplexity 14.11 vs. 13.86) on downstream tasks, outperforming naive distillation, supported by ablations on sequence-mixer variants, scaling with model size and tokens, and token-allocation sensitivity at 1B scale with 10B tokens.

Significance. If the empirical results hold under scrutiny, the work provides a concrete, non-hybrid recipe for leveraging existing Transformer pretraining to initialize efficient SSMs, addressing a known limitation in direct Transformer-to-Mamba distillation. The scaling and ablation analyses add value by exploring practical hyperparameters, though the absence of intermediate-stage metrics limits attribution of gains specifically to the proposed initialization.

major comments (2)
  1. [Two-stage distillation procedure] The load-bearing assumption is that the Stage-1 linearized-attention model retains sufficient representational capacity from the original Pythia-1B teacher. No perplexity, downstream scores, or attention-pattern metrics are reported for this intermediate model, so it is impossible to verify whether the kernel-trick adaptation introduces non-negligible distortion before Stage 2 begins. An ablation reporting Stage-1 performance (or an oracle comparison) is required to substantiate the two-stage claim.
  2. [Experimental results] Table or figure reporting final perplexity (14.11 vs. 13.86) lacks error bars, standard deviations across seeds, or statistical significance tests. Given the small gap and the reader's note on missing loss formulations and data-selection details, reproducibility of the central result cannot be assessed from the provided evidence.
minor comments (2)
  1. [Abstract] The abstract and methods would benefit from an explicit equation or pseudocode block defining the kernel-trick adaptation objective and the Mamba initialization derived from it.
  2. [Ablations] Ablation figures on token allocation between stages should include a baseline that allocates all tokens to a single-stage Mamba distillation for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Two-stage distillation procedure] The load-bearing assumption is that the Stage-1 linearized-attention model retains sufficient representational capacity from the original Pythia-1B teacher. No perplexity, downstream scores, or attention-pattern metrics are reported for this intermediate model, so it is impossible to verify whether the kernel-trick adaptation introduces non-negligible distortion before Stage 2 begins. An ablation reporting Stage-1 performance (or an oracle comparison) is required to substantiate the two-stage claim.

    Authors: We agree that the performance of the intermediate Stage-1 model is important to substantiate the two-stage claim and to quantify any potential distortion from the kernel-trick adaptation. In the revised manuscript we will add a dedicated section and table reporting perplexity and downstream task scores for the linearized attention model immediately after Stage 1. This will allow direct verification that sufficient representational capacity is retained before distillation into the Mamba architecture proceeds. revision: yes

  2. Referee: [Experimental results] Table or figure reporting final perplexity (14.11 vs. 13.86) lacks error bars, standard deviations across seeds, or statistical significance tests. Given the small gap and the reader's note on missing loss formulations and data-selection details, reproducibility of the central result cannot be assessed from the provided evidence.

    Authors: We acknowledge the need for greater statistical transparency and reproducibility details. In the revision we will expand the experimental section and appendix to include the precise loss formulations, data-selection criteria, and token-allocation procedure. Because of the substantial compute required for 1B-scale runs, the main results were obtained with a single seed; however, we will add multi-seed variance estimates for the smaller-scale ablations and scaling experiments already present in the paper. We believe the consistent advantage over naive distillation across ablations and model sizes still supports the central claim, but the added details will improve assessability of the reported perplexity gap. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's central claims rest on empirical outcomes from a two-stage distillation procedure (Transformer to linearized attention via kernel-trick adaptation, then to Mamba), with reported metrics such as perplexity 14.11 vs. teacher's 13.86 arising from training runs and ablations rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text that would reduce the results to tautological inputs by construction. The recipe is presented as a procedural method validated by experiments at 1B scale, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5623 in / 1073 out tokens · 53930 ms · 2026-05-13T23:01:24.336790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 10 internal anchors

  1. [1]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In3rd International Conference on Learning Representations, ICLR 2015,

  2. [2]

    Longformer: The Long-Document Transformer

    URLhttps: //arxiv.org/abs/2004.05150. Aviv Bick, Kevin Li, Eric P. Xing, J Zico Kolter, and Albert Gu. Transformers to SSMs: Distilling quadratic knowledge to subquadratic models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  3. [3]

    URLhttps://arxiv.org/abs/2502.14458. Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mo- hammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, pp...

  4. [4]

    Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

    URLhttps://arxiv.org/abs/2502.08606. Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. Analyzing the evolution and mainte- nance of ml models on hugging face,

  5. [5]

    URLhttps://arxiv.org/abs/2311.13380. Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Re- thinking attention with performers,

  6. [6]

    Rethinking Attention with Performers

    URLhttps://arxiv.org/abs/2009.14794. Nicola Muca Cirone, Antonio Orvieto, Benjamin Walker, Cristopher Salvi, and Terry Lyons. Theoretical foundations of deep selective state-space models,

  7. [7]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova

    URLhttps://arxiv.org/abs/2402.19047. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions,

  8. [8]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    URLhttps://arxiv.org/abs/1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  9. [9]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  10. [10]

    doi:10.5281/zenodo.5371628 , url =

    URLhttps: //doi.org/10.5281/zenodo.5371628. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. OpenWebText corpus.http://Skylion007.github. io/OpenWebTextCorpus,

  11. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    ISSN 1573-1405. doi: 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752,

  12. [12]

    Albert Gu, Karan Goel, and Christopher Ré

    URLhttps://arxiv.org/abs/2008.07669. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022a. URL https://arxiv.org/abs/2111.00396. Albert Gu, Ankit Gupta, Karan Goel, and Christopher Ré. On the parameterization and initialization of diagonal state space models, 2022b. URLhttps://arxiv.org/abs/220...

  13. [13]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean

    URLhttps://arxiv.org/abs/2410.06846. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network,

  14. [14]

    Distilling the Knowledge in a Neural Network

    URLhttps: //arxiv.org/abs/1503.02531. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models,

  15. [15]

    LoRA: Low-Rank Adaptation of Large Language Models

    URLhttps://arxiv.org/abs/2106.09685. Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. Transformer quality in linear time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.),Proceedings of the 39th Interna- tional Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pp. 9...

  16. [16]

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret

    URLhttps://arxiv.org/abs/2103.13076. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autore- gressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR,

  17. [17]

    RACE: Large-scale ReAding comprehension dataset from examinations

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale ReAding comprehension dataset from examinations. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.),Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794, Copenhagen, Denmark, September

  18. [18]

    RACE : Large-scale R e A ding Comprehension Dataset From Examinations

    Association for Computational Linguistics. doi: 10.18653/v1/D17-1082. URLhttps://aclanthology.org/ D17-1082. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning,

  19. [19]

    Huanru Henry Mao

    URLhttps://arxiv.org/abs/2503.12067. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights,

  20. [20]

    org/abs/2210.04243

    URLhttps://arxiv. org/abs/2210.04243. Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models,

  21. [21]

    URLhttps://arxiv.org/abs/2405.06640. J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 209:415–446,

  22. [22]

    URLhttps://arxiv.org/abs/1606.06031. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Jiaju Lin, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsush...

  23. [23]

    RWKV: Reinventing RNNs for the Transformer Era

    URLhttps://arxiv.org/abs/2305.13048. Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A. Smith, and Lingpeng Kong. Random feature attention,

  24. [24]

    Random feature attention.arXiv preprint arXiv:2103.02143,

    URLhttps://arxiv.org/abs/2103.02143. 12 Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pp. 28043–28078. PMLR,

  25. [25]

    Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong

    Issue #120, Accessed: 2025-08-01. Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention,

  26. [26]

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin

    URLhttps://arxiv.org/abs/2202.08791. Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free,

  27. [27]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    URLhttps://arxiv.org/abs/2505.06708. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners

  28. [28]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

    URLhttps: //arxiv.org/abs/2401.17574. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641,

  29. [29]

    Social IQa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.),Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing...

  30. [30]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URLhttps://aclanthology.org/D19-1454/. Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

  31. [31]

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma

    URLhttps://arxiv.org/abs/2408.15237. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,

  32. [32]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  33. [33]

    URLhttps://www.aclweb.org/anthology/2020.emnlp-demos.6

    Association for Computational Linguistics. URLhttps://www.aclweb.org/anthology/2020.emnlp-demos.6. Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nys- trömformer: A nyström-based algorithm for approximating self-attention. InProceedings of the AAAI conference on artificial intelligence, volume 35, pp...

  34. [34]

    Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed

    URL https://arxiv.org/abs/2407.01885. Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big bird: Transformers for longer sequences,

  35. [35]

    Big bird: Transformers for longer sequences, 2020

    URL https://arxiv.org/abs/2007.14062. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

  36. [36]

    The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

    13 Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347,

  37. [37]

    Lolcats: On low-rank linearizing of large language models

    URLhttps://arxiv.org/abs/ 2410.10254. Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. A robustly optimized BERT pre-training approach with post-training. In Sheng Li, Maosong Sun, Yang Liu, Hua Wu, Kang Liu, Wanxiang Che, Shizhu He, and Gaoqi Rao (eds.), Proceedings of the 20th Chinese National Conference on Computational Linguistics, pp. 1218–1227, Huhhot, ...

  38. [38]

    URL https://aclanthology.org/2021.ccl-1

    Chinese Information Processing Society of China. URL https://aclanthology.org/2021.ccl-1. 108/. 14 A Additional results and implementation details A.1 Expanded Results Summary In this section we expand the results in Sec. 4 and find that: • Increasing the scale of the architectures (from 160M, to 410M and 1B) reduces PPL and increases all other performanc...

  39. [39]

    Relevant code for this error computation can be found in https://github.com/EleutherAI/lm-evaluation-harness

    3LM harness evaluation estimates the Standard Error of the Mean (SEM) using bootstrap resampling: it repeatedly sam- ples a set of multiple choice questions (for a specified number of bootstrap iterations, 100K in our case) and then calcu- lates the SEM of the metric scores obtained from these samples. Relevant code for this error computation can be found...

  40. [40]

    B Additional Architecture details B.1 Full architectures schematics For reference, the diagrams in Fig

    Following prior work (Gu & Dao, 2023; Dao & Gu, 2024), we use gradient clipping of 1.0 and weight decay 0.1. B Additional Architecture details B.1 Full architectures schematics For reference, the diagrams in Fig. 4 describe the complete architectures discussed in this project. The Pythia Transformer Biderman et al. (2023), which is the teacher model used ...

  41. [41]

    More in detail, Lst

    18 C Pseudocode Here we provide pseudocode for the application of the forward pass of the student model used in our experi- ments. More in detail, Lst. 1 reports the implementation of the whole adapted Pythia block Biderman et al. (2023). As in the original Pythia model, the flow of operations in this block is split into two branches (see also Fig. 4). On...

  42. [42]

    Finally, in Lst

    Its code blueprint closely follows the one for the Mamba SSM mixer Gu & Dao (2023), including a gate branch and a short convolution before the mixer application, but presents three main differences: (i) the SSM parameters B, C (covering the roles ofkeys and queries in Linear Attention) are further modified according to the Hedgehog map; (ii) the input to ...