pith. sign in

arxiv: 2510.04800 · v3 · submitted 2025-10-06 · 💻 cs.CL

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Pith reviewed 2026-05-18 10:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords hybrid architecturesself-attentionstate space modelsMambalanguage modelinglong-contextmodel efficiency
0
0 comments X

The pith

Hybrid language models achieve better efficiency and long-context performance through targeted inter-layer or intra-layer fusion of self-attention and state space models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two hybridization approaches for language models: placing self-attention and Mamba-style state space layers in sequence or running them side by side inside the same layer. It tests the resulting models on language modeling accuracy, downstream tasks, long-sequence handling, scaling trends, and training plus inference speed. The analysis of each approach's basic computational building blocks identifies which features matter most for success and produces concrete recipes for the best designs. These findings matter to readers who want to build models that keep high quality while using far less compute on extended contexts than standard transformers.

Core claim

Hybrid architectures achieve a compelling balance between modeling quality and computational efficiency by combining self-attention mechanisms with structured state space models. The analysis of inter-layer sequential fusion and intra-layer parallel fusion reveals the most critical elements for each strategy based on their computational primitives, leading to proposed optimal design recipes for hybrid models.

What carries the argument

Inter-layer sequential fusion and intra-layer parallel fusion between self-attention and structured state space models, which together determine the balance of quality and efficiency.

If this is right

  • Following the identified recipes produces hybrids with stronger long-context performance than either pure attention or pure state-space models.
  • Training and inference costs drop while quality on standard tasks stays the same or rises.
  • Different fusion choices prove more important for certain task families than for others.
  • The relative importance of each component's primitive can be used to predict which hybridization will scale best.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same analysis method could be applied to hybrids that mix attention with other state-space variants or with linear attention.
  • Hardware-specific constraints might favor one fusion pattern over the other when memory bandwidth or compute type changes.
  • The recipes could be checked at much larger scales to see whether the critical elements remain stable.

Load-bearing premise

The five evaluation dimensions of language modeling, downstream tasks, long-context capabilities, scaling, and efficiency are sufficient to expose the decisive factors and support design recommendations that hold across model sizes and tasks.

What would settle it

A hybrid model constructed from the paper's optimal design recipes shows no improvement over baselines in long-context benchmarks or exhibits worse scaling efficiency when model size increases.

read the original abstract

Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a holistic evaluation of hybrid architectures for language models that combine self-attention mechanisms with structured state space models such as Mamba. It systematically compares inter-layer (sequential) and intra-layer (parallel) fusion strategies across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitives, the authors identify the most critical elements for each hybridization strategy and propose optimal design recipes for hybrid models, aiming to provide practical guidance for developing such architectures.

Significance. If the results and analyses hold up under scrutiny, this work would be significant for the field by filling a gap in systematic comparisons of hybridization strategies. It could offer actionable insights into balancing modeling quality and efficiency, particularly beneficial for long-context language modeling tasks. The multi-dimensional evaluation framework, if rigorously implemented with appropriate controls, represents a valuable contribution to architectural design in large language models.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.
  2. [Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.
minor comments (1)
  1. Clarify the exact definitions and implementation details of inter-layer versus intra-layer fusion, perhaps with pseudocode or additional diagrams, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential significance. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights to better ground the claims. In the revised version, we will incorporate specific details such as the range of model sizes evaluated (e.g., from 350M to 7B parameters), primary datasets and benchmarks (WikiText-103 for language modeling, standard downstream suites including MMLU and long-context tasks), and representative results (e.g., perplexity improvements and efficiency gains relative to pure attention and Mamba baselines). revision: yes

  2. Referee: [Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.

    Authors: We appreciate this methodological point. Our experimental design matched total parameter counts across inter-layer, intra-layer, and baseline models, and we report both theoretical and measured FLOPs for training and inference. To strengthen the manuscript, we will add an expanded subsection on experimental controls, explicitly describing how fusion topology was varied while holding parameter count and high-level computational budget fixed, along with details on kernel implementations and hardware to mitigate concerns about confounding factors. We note that complete isolation from all low-level optimizations remains difficult in practice, but the multi-axis evaluation and scaling trends provide evidence that the observed differences are driven primarily by the hybridization strategy rather than implementation artifacts alone. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation chain

full rationale

The paper performs a systematic empirical study of hybrid architectures through direct experimentation on language modeling, downstream tasks, long-context, scaling, and efficiency. Claims about critical elements and design recipes are derived from observed performance trends rather than from equations, fitted parameters renamed as predictions, or self-referential citations. No load-bearing steps reduce to inputs by construction; the analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study; no mathematical axioms, free parameters, or invented entities are introduced or required by the abstract description.

pith-pipeline@v0.9.0 · 5708 in / 1169 out tokens · 34791 ms · 2026-05-18T10:14:20.893426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  2. SoK: Honeypots & LLMs, More Than the Sum of Their Parts?

    cs.CR 2025-10 unverdicted novelty 7.0

    A systematization of knowledge paper that taxonomizes honeypot detection vectors, synthesizes LLM-honeypot literature into canonical architecture and evaluation methods, and proposes a roadmap for autonomous deception...

  3. Priming: Hybrid State Space Models From Pre-trained Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...

  4. Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

    cs.LG 2025-11 unverdicted novelty 6.0

    Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

  5. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 5 Pith papers · 35 internal anchors

  1. [1]

    Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

    Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J Yadwadkar, and Carole-Jean Wu. Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

  2. [2]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

  3. [3]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

  4. [4]

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

    Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444,

  5. [5]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

  6. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

  7. [7]

    Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

    Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

  8. [8]

    L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P

    Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. Recurrentgemma: Moving past transformers for efficient open language models.arXiv preprint arXiv:2404.07839,

  9. [9]

    Command a: An enterprise-ready large language model,

    Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

  10. [10]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  11. [11]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

  12. [12]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  13. [13]

    Hymba: A hybrid-head architecture for small language models

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

  14. [14]

    Hungry hungry hippos: To- wards language modeling with state space models

    11 Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,

  15. [15]

    He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

  16. [16]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Google Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  17. [17]

    Gemma 3 Technical Report

    Google Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

  18. [18]

    The zamba2 suite: Technical report

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report.arXiv preprint arXiv:2411.15242, 2024a. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybr...

  19. [19]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural infor...

  20. [20]

    Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884,

  21. [21]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  22. [22]

    Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

    Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, and Fei Richard Yu. Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

  23. [23]

    Hunyuan-TurboS: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought,

    Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, et al. Hunyuan-turbos: Advancing large language models through mamba- transformer synergy and adaptive chain-of-thought.arXiv preprint arXiv:2505.15431,

  24. [24]

    Jamba-1.5: Hybrid transformer-mamba models at scale

    Ai2 Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

  25. [25]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, 12 Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.https://arxiv....

  26. [26]

    Scaling Laws for Neural Language Models

    Accessed: 2023-10-31. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  27. [27]

    Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

    Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

  28. [28]

    MiniMax-01: Scaling Foundation Models with Lightning Attention

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, K...

  29. [29]

    Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b

    Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, et al. Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b. Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie...

  30. [30]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

  31. [31]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

  32. [32]

    The Llama 3 Herd of Models

    Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  33. [33]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  34. [34]

    MoBA: Mixture of Block Attention for Long-Context LLMs

    Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

  35. [35]

    Phi-4 Technical Report

    Microsoft Research. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

  36. [36]

    OpenAI o1 System Card

    OpenaAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  37. [37]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

  38. [38]

    arXiv preprint arXiv:2412.09871 , year=

    Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, et al. Byte latent transformer: Patches scale better than tokens.arXiv preprint arXiv:2412.09871,

  39. [39]

    Rwkv-7" goose" with expressive dynamic state evolution

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

  40. [40]

    arXiv preprint arXiv:2403.17844 , year=

    13 Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844,

  41. [41]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

  42. [42]

    The devil in linear transformer.arXiv preprint arXiv:2210.10340,

    Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer.arXiv preprint arXiv:2210.10340,

  43. [43]

    Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

  44. [44]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

  45. [45]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  46. [46]

    Compressive Transformers for Long-Range Sequence Modelling

    Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019a.https://arxiv.org/abs/1911.05507. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arX...

  47. [47]

    InInternational Conference on Learning Representations (ICLR)

    Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

  48. [48]

    Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

    Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation. arXiv preprint arXiv:2507.06607,

  49. [49]

    Differential mamba.arXiv preprint arXiv:2507.06204,

    Nadav Schneider, Itamar Zimerman, and Eliya Nachmani. Differential mamba.arXiv preprint arXiv:2507.06204,

  50. [50]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020.https://arxiv.org/abs/2002.05202. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

  51. [51]

    Simplified State Space Layers for Sequence Modeling

    Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933,

  52. [52]

    Speed Always Wins: a survey on efficient architectures for large language models

    Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834,

  53. [53]

    Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

    Armin W Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, and Michael Poli. Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

  54. [54]

    An Empirical Study of Mamba-based Language Models

    Roger Waleffe, Wonmin Byeon, DuncanRiach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887,

  55. [55]

    org/abs/2507.06457

    Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457,

  56. [56]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024a. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint ...

  57. [57]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

  58. [58]

    Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

    Liu Xiao, Li Zhiyuan, and Lin Yueyu. Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

  59. [59]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,

  60. [60]

    Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

    Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

  61. [61]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

  62. [62]

    Differential transformer, 2024

    Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258,

  63. [63]

    Lolcats: On low- rank linearizing of large language models

    Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

  64. [64]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

  65. [65]

    Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

    Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

  66. [66]

    Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures

    and Llama 4 to polish the overall writing after drafting the paper ourselves. Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures. C Details for Computational and Memory Costs Comparison FLOPs per sample.Most parameters are linear weight matrices, so total FLOPs can be estimated by mu...

  67. [67]

    We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales

    and the Mamba architectures (Dao and Gu, 2024). We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales. Table6 provides a summary of the detailed architectures for both the Transformer and Mamba models, which serve as the foundational computational primitives for our hybrid architecture variants. Base Configuration ...

  68. [68]

    and the PG19 datasets (Rae et al., 2019b). Additionally, following the settings of prior work (Gemini Team et al., 2024; Ho et al., 2024), we conduct evaluations on the Needle-In-a- Haystack long-contextbenchmark(Kamradt,2023;Kuratovetal., 2024). Wefurthermeasure few-shotaccuracy on five benchmarks using the Language Model Evaluation Harness (Gao et al., ...