Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Pith reviewed 2026-05-18 10:14 UTC · model grok-4.3
The pith
Hybrid language models achieve better efficiency and long-context performance through targeted inter-layer or intra-layer fusion of self-attention and state space models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hybrid architectures achieve a compelling balance between modeling quality and computational efficiency by combining self-attention mechanisms with structured state space models. The analysis of inter-layer sequential fusion and intra-layer parallel fusion reveals the most critical elements for each strategy based on their computational primitives, leading to proposed optimal design recipes for hybrid models.
What carries the argument
Inter-layer sequential fusion and intra-layer parallel fusion between self-attention and structured state space models, which together determine the balance of quality and efficiency.
If this is right
- Following the identified recipes produces hybrids with stronger long-context performance than either pure attention or pure state-space models.
- Training and inference costs drop while quality on standard tasks stays the same or rises.
- Different fusion choices prove more important for certain task families than for others.
- The relative importance of each component's primitive can be used to predict which hybridization will scale best.
Where Pith is reading between the lines
- The same analysis method could be applied to hybrids that mix attention with other state-space variants or with linear attention.
- Hardware-specific constraints might favor one fusion pattern over the other when memory bandwidth or compute type changes.
- The recipes could be checked at much larger scales to see whether the critical elements remain stable.
Load-bearing premise
The five evaluation dimensions of language modeling, downstream tasks, long-context capabilities, scaling, and efficiency are sufficient to expose the decisive factors and support design recommendations that hold across model sizes and tasks.
What would settle it
A hybrid model constructed from the paper's optimal design recipes shows no improvement over baselines in long-context benchmarks or exhibits worse scaling efficiency when model size increases.
read the original abstract
Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a holistic evaluation of hybrid architectures for language models that combine self-attention mechanisms with structured state space models such as Mamba. It systematically compares inter-layer (sequential) and intra-layer (parallel) fusion strategies across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitives, the authors identify the most critical elements for each hybridization strategy and propose optimal design recipes for hybrid models, aiming to provide practical guidance for developing such architectures.
Significance. If the results and analyses hold up under scrutiny, this work would be significant for the field by filling a gap in systematic comparisons of hybridization strategies. It could offer actionable insights into balancing modeling quality and efficiency, particularly beneficial for long-context language modeling tasks. The multi-dimensional evaluation framework, if rigorously implemented with appropriate controls, represents a valuable contribution to architectural design in large language models.
major comments (2)
- [Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.
- [Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.
minor comments (1)
- Clarify the exact definitions and implementation details of inter-layer versus intra-layer fusion, perhaps with pseudocode or additional diagrams, to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the work's potential significance. We address each major comment below with specific plans for revision where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.
Authors: We agree that the abstract would be strengthened by including key quantitative highlights to better ground the claims. In the revised version, we will incorporate specific details such as the range of model sizes evaluated (e.g., from 350M to 7B parameters), primary datasets and benchmarks (WikiText-103 for language modeling, standard downstream suites including MMLU and long-context tasks), and representative results (e.g., perplexity improvements and efficiency gains relative to pure attention and Mamba baselines). revision: yes
-
Referee: [Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.
Authors: We appreciate this methodological point. Our experimental design matched total parameter counts across inter-layer, intra-layer, and baseline models, and we report both theoretical and measured FLOPs for training and inference. To strengthen the manuscript, we will add an expanded subsection on experimental controls, explicitly describing how fusion topology was varied while holding parameter count and high-level computational budget fixed, along with details on kernel implementations and hardware to mitigate concerns about confounding factors. We note that complete isolation from all low-level optimizations remains difficult in practice, but the multi-axis evaluation and scaling trends provide evidence that the observed differences are driven primarily by the hybridization strategy rather than implementation artifacts alone. revision: partial
Circularity Check
Empirical evaluation with no circular derivation chain
full rationale
The paper performs a systematic empirical study of hybrid architectures through direct experimentation on language modeling, downstream tasks, long-context, scaling, and efficiency. Claims about critical elements and design recipes are derived from observed performance trends rather than from equations, fitted parameters renamed as predictions, or self-referential citations. No load-bearing steps reduce to inputs by construction; the analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
SoK: Honeypots & LLMs, More Than the Sum of Their Parts?
A systematization of knowledge paper that taxonomizes honeypot detection vectors, synthesizes LLM-honeypot literature into canonical architecture and evaluation methods, and proposes a roadmap for autonomous deception...
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
-
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Reference graph
Works this paper leans on
-
[1]
Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,
Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J Yadwadkar, and Carole-Jean Wu. Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,
-
[2]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444,
work page internal anchor Pith review arXiv
-
[5]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[7]
Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,
Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,
-
[8]
L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P
Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. Recurrentgemma: Moving past transformers for efficient open language models.arXiv preprint arXiv:2404.07839,
-
[9]
Command a: An enterprise-ready large language model,
Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,
-
[10]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Hymba: A hybrid-head architecture for small language models
Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,
-
[14]
Hungry hungry hippos: To- wards language modeling with state space models
11 Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,
-
[15]
He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...
-
[16]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Google Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Google Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
The zamba2 suite: Technical report
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report.arXiv preprint arXiv:2411.15242, 2024a. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybr...
-
[19]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural infor...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884,
-
[21]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,
Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, and Fei Richard Yu. Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,
-
[23]
Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, et al. Hunyuan-turbos: Advancing large language models through mamba- transformer synergy and adaptive chain-of-thought.arXiv preprint arXiv:2505.15431,
-
[24]
Jamba-1.5: Hybrid transformer-mamba models at scale
Ai2 Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,
-
[25]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, 12 Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.https://arxiv....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Scaling Laws for Neural Language Models
Accessed: 2023-10-31. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,
-
[28]
MiniMax-01: Scaling Foundation Models with Lightning Attention
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, K...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b
Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, et al. Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b. Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie...
-
[30]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
MoBA: Mixture of Block Attention for Long-Context LLMs
Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Microsoft Research. Phi-4 technical report.arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
OpenaAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
arXiv preprint arXiv:2412.09871 , year=
Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, et al. Byte latent transformer: Patches scale better than tokens.arXiv preprint arXiv:2412.09871,
-
[39]
Rwkv-7" goose" with expressive dynamic state evolution
Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,
-
[40]
arXiv preprint arXiv:2403.17844 , year=
13 Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844,
-
[41]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
The devil in linear transformer.arXiv preprint arXiv:2210.10340,
Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer.arXiv preprint arXiv:2210.10340,
-
[43]
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,
-
[44]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Compressive Transformers for Long-Range Sequence Modelling
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019a.https://arxiv.org/abs/1911.05507. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arX...
work page internal anchor Pith review arXiv 1911
-
[47]
InInternational Conference on Learning Representations (ICLR)
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,
-
[48]
Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation. arXiv preprint arXiv:2507.06607,
-
[49]
Differential mamba.arXiv preprint arXiv:2507.06204,
Nadav Schneider, Itamar Zimerman, and Eliya Nachmani. Differential mamba.arXiv preprint arXiv:2507.06204,
-
[50]
GLU Variants Improve Transformer
Noam Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020.https://arxiv.org/abs/2002.05202. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[51]
Simplified State Space Layers for Sequence Modeling
Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Speed Always Wins: a survey on efficient architectures for large language models
Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834,
-
[53]
Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,
Armin W Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, and Michael Poli. Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,
-
[54]
An Empirical Study of Mamba-based Language Models
Roger Waleffe, Wonmin Byeon, DuncanRiach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887,
work page internal anchor Pith review arXiv
-
[55]
Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457,
-
[56]
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024a. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,
Liu Xiao, Li Zhiyuan, and Lin Yueyu. Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,
-
[59]
Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,
Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,
-
[61]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Differential transformer, 2024
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258,
-
[63]
Lolcats: On low- rank linearizing of large language models
Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,
-
[64]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,
Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,
-
[66]
and Llama 4 to polish the overall writing after drafting the paper ourselves. Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures. C Details for Computational and Memory Costs Comparison FLOPs per sample.Most parameters are linear weight matrices, so total FLOPs can be estimated by mu...
work page 2022
-
[67]
We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales
and the Mamba architectures (Dao and Gu, 2024). We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales. Table6 provides a summary of the detailed architectures for both the Transformer and Mamba models, which serve as the foundational computational primitives for our hybrid architecture variants. Base Configuration ...
work page 2024
-
[68]
and the PG19 datasets (Rae et al., 2019b). Additionally, following the settings of prior work (Gemini Team et al., 2024; Ho et al., 2024), we conduct evaluations on the Needle-In-a- Haystack long-contextbenchmark(Kamradt,2023;Kuratovetal., 2024). Wefurthermeasure few-shotaccuracy on five benchmarks using the Language Model Evaluation Harness (Gao et al., ...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.