Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Bilge Acun; Carole-Jean Wu; Chien-Yu Lin; Haroun Habeeb; Junjie Wang; Liang Luo; Sangmin Bae; Seungyeon Kim

arxiv: 2510.04800 · v3 · submitted 2025-10-06 · 💻 cs.CL

Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae , Bilge Acun , Chien-Yu Lin , Haroun Habeeb , Seungyeon Kim , Liang Luo , Junjie Wang , Carole-Jean Wu This is my paper

Pith reviewed 2026-05-18 10:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords hybrid architecturesself-attentionstate space modelsMambalanguage modelinglong-contextmodel efficiency

0 comments

The pith

Hybrid language models achieve better efficiency and long-context performance through targeted inter-layer or intra-layer fusion of self-attention and state space models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two hybridization approaches for language models: placing self-attention and Mamba-style state space layers in sequence or running them side by side inside the same layer. It tests the resulting models on language modeling accuracy, downstream tasks, long-sequence handling, scaling trends, and training plus inference speed. The analysis of each approach's basic computational building blocks identifies which features matter most for success and produces concrete recipes for the best designs. These findings matter to readers who want to build models that keep high quality while using far less compute on extended contexts than standard transformers.

Core claim

Hybrid architectures achieve a compelling balance between modeling quality and computational efficiency by combining self-attention mechanisms with structured state space models. The analysis of inter-layer sequential fusion and intra-layer parallel fusion reveals the most critical elements for each strategy based on their computational primitives, leading to proposed optimal design recipes for hybrid models.

What carries the argument

Inter-layer sequential fusion and intra-layer parallel fusion between self-attention and structured state space models, which together determine the balance of quality and efficiency.

If this is right

Following the identified recipes produces hybrids with stronger long-context performance than either pure attention or pure state-space models.
Training and inference costs drop while quality on standard tasks stays the same or rises.
Different fusion choices prove more important for certain task families than for others.
The relative importance of each component's primitive can be used to predict which hybridization will scale best.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same analysis method could be applied to hybrids that mix attention with other state-space variants or with linear attention.
Hardware-specific constraints might favor one fusion pattern over the other when memory bandwidth or compute type changes.
The recipes could be checked at much larger scales to see whether the critical elements remain stable.

Load-bearing premise

The five evaluation dimensions of language modeling, downstream tasks, long-context capabilities, scaling, and efficiency are sufficient to expose the decisive factors and support design recommendations that hold across model sizes and tasks.

What would settle it

A hybrid model constructed from the paper's optimal design recipes shows no improvement over baselines in long-context benchmarks or exhibits worse scaling efficiency when model size increases.

read the original abstract

Recent progress in large language models demonstrates that hybrid architectures--combining self-attention mechanisms with structured state space models like Mamba--can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper runs a side-by-side comparison of inter-layer and intra-layer hybrids of attention with Mamba-style models and pulls out some practical design pointers, but the evidence for general recipes is thinner than the claims suggest.

read the letter

Colleague, the main takeaway is that this work organizes existing hybrid ideas into a clearer empirical comparison. They test sequential inter-layer fusion against parallel intra-layer fusion, then evaluate across language modeling, downstream tasks, long-context handling, scaling trends, and efficiency. From that they flag what seems to matter most in each case and offer some design recipes. That synthesis is the actual new piece here; the individual components are already in the literature they cite, but the organized head-to-head look is useful for people who just want to know what tends to work when mixing these primitives. They do a reasonable job laying out the different fusion patterns and running them through the listed evaluation axes. If the full results include clean ablations and consistent trends across model sizes, the efficiency and scaling sections could give practitioners some concrete starting points without having to rerun everything themselves. The soft spot is that the central claim about identifying the most critical elements and proposing optimal recipes rests on those five evaluation dimensions being enough to isolate the effects of the hybridization strategy itself. Efficiency numbers in particular are easy to move with kernel choices or memory layouts rather than the fusion topology, and without explicit controls that hold total FLOPs or parameter count fixed while varying only the layer arrangement, the differences may not generalize. The abstract does not surface any specific numbers, model scales, or statistical details, so it is hard to judge how robust the recipes actually are once you get past the high-level trends. This is for engineers and researchers who are already building or tuning long-context models and want empirical guidance on hybrid configurations rather than a new theoretical primitive. A reader in that position could extract some actionable ideas from the comparison, even if they end up re-testing the recipes at their own scale. It is worth sending to peer review because the topic is practical and timely, and a referee can push on the controls and quantification without the work being incoherent on its own terms.

Referee Report

2 major / 1 minor

Summary. The paper presents a holistic evaluation of hybrid architectures for language models that combine self-attention mechanisms with structured state space models such as Mamba. It systematically compares inter-layer (sequential) and intra-layer (parallel) fusion strategies across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitives, the authors identify the most critical elements for each hybridization strategy and propose optimal design recipes for hybrid models, aiming to provide practical guidance for developing such architectures.

Significance. If the results and analyses hold up under scrutiny, this work would be significant for the field by filling a gap in systematic comparisons of hybridization strategies. It could offer actionable insights into balancing modeling quality and efficiency, particularly beneficial for long-context language modeling tasks. The multi-dimensional evaluation framework, if rigorously implemented with appropriate controls, represents a valuable contribution to architectural design in large language models.

major comments (2)

[Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.
[Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.

minor comments (1)

Clarify the exact definitions and implementation details of inter-layer versus intra-layer fusion, perhaps with pseudocode or additional diagrams, to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's potential significance. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts comprehensive evaluation and identification of critical elements for each hybridization strategy, yet supplies no quantitative results, model sizes, datasets, or statistical details, leaving the central claims unsupported by visible evidence.

Authors: We agree that the abstract would be strengthened by including key quantitative highlights to better ground the claims. In the revised version, we will incorporate specific details such as the range of model sizes evaluated (e.g., from 350M to 7B parameters), primary datasets and benchmarks (WikiText-103 for language modeling, standard downstream suites including MMLU and long-context tasks), and representative results (e.g., perplexity improvements and efficiency gains relative to pure attention and Mamba baselines). revision: yes
Referee: [Evaluation] Evaluation dimensions: The assumption that the five listed evaluation axes (language modeling, downstream tasks, long-context capabilities, scaling, and efficiency) are jointly sufficient to isolate what drives effectiveness and support generalizable design recommendations is load-bearing for the strongest claim. Scaling curves and efficiency metrics can be dominated by implementation details (e.g., kernel fusion, memory layout) rather than the hybridization strategy itself; without explicit controls that hold total FLOPs or parameter count fixed while varying only the fusion topology, observed differences may not generalize beyond the tested model sizes and tasks.

Authors: We appreciate this methodological point. Our experimental design matched total parameter counts across inter-layer, intra-layer, and baseline models, and we report both theoretical and measured FLOPs for training and inference. To strengthen the manuscript, we will add an expanded subsection on experimental controls, explicitly describing how fusion topology was varied while holding parameter count and high-level computational budget fixed, along with details on kernel implementations and hardware to mitigate concerns about confounding factors. We note that complete isolation from all low-level optimizations remains difficult in practice, but the multi-axis evaluation and scaling trends provide evidence that the observed differences are driven primarily by the hybridization strategy rather than implementation artifacts alone. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation with no circular derivation chain

full rationale

The paper performs a systematic empirical study of hybrid architectures through direct experimentation on language modeling, downstream tasks, long-context, scaling, and efficiency. Claims about critical elements and design recipes are derived from observed performance trends rather than from equations, fitted parameters renamed as predictions, or self-referential citations. No load-bearing steps reduce to inputs by construction; the analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study; no mathematical axioms, free parameters, or invented entities are introduced or required by the abstract description.

pith-pipeline@v0.9.0 · 5708 in / 1169 out tokens · 34791 ms · 2026-05-18T10:14:20.893426+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
SoK: Honeypots & LLMs, More Than the Sum of Their Parts?
cs.CR 2025-10 unverdicted novelty 7.0

A systematization of knowledge paper that taxonomizes honeypot detection vectors, synthesizes LLM-honeypot literature into canonical architecture and evaluation methods, and proposes a roadmap for autonomous deception...
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
cs.LG 2025-11 unverdicted novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 5 Pith papers · 35 internal anchors

[1]

Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J Yadwadkar, and Carole-Jean Wu. Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

work page arXiv
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444,

work page internal anchor Pith review arXiv
[5]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[7]

Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

work page arXiv
[8]

L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. Recurrentgemma: Moving past transformers for efficient open language models.arXiv preprint arXiv:2404.07839,

work page arXiv
[9]

Command a: An enterprise-ready large language model,

Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

work page arXiv
[10]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

work page arXiv
[14]

Hungry hungry hippos: To- wards language modeling with state space models

11 Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,

work page arXiv
[15]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[16]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Gemma 3 Technical Report

Google Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

The zamba2 suite: Technical report

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report.arXiv preprint arXiv:2411.15242, 2024a. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybr...

work page arXiv
[19]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural infor...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884,

work page arXiv
[21]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, and Fei Richard Yu. Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

work page arXiv
[23]

Hunyuan-TurboS: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought,

Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, et al. Hunyuan-turbos: Advancing large language models through mamba- transformer synergy and adaptive chain-of-thought.arXiv preprint arXiv:2505.15431,

work page arXiv
[24]

Jamba-1.5: Hybrid transformer-mamba models at scale

Ai2 Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

work page arXiv
[25]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, 12 Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.https://arxiv....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Scaling Laws for Neural Language Models

Accessed: 2023-10-31. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

work page arXiv
[28]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, K...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b

Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, et al. Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b. Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie...

work page arXiv
[30]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

The Llama 3 Herd of Models

Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Phi-4 Technical Report

Microsoft Research. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

OpenAI o1 System Card

OpenaAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2412.09871 , year=

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, et al. Byte latent transformer: Patches scale better than tokens.arXiv preprint arXiv:2412.09871,

work page arXiv
[39]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv
[40]

arXiv preprint arXiv:2403.17844 , year=

13 Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844,

work page arXiv
[41]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

The devil in linear transformer.arXiv preprint arXiv:2210.10340,

Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer.arXiv preprint arXiv:2210.10340,

work page arXiv
[43]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

work page arXiv
[44]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019a.https://arxiv.org/abs/1911.05507. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arX...

work page internal anchor Pith review arXiv 1911
[47]

InInternational Conference on Learning Representations (ICLR)

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

work page arXiv
[48]

Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation. arXiv preprint arXiv:2507.06607,

work page arXiv
[49]

Differential mamba.arXiv preprint arXiv:2507.06204,

Nadav Schneider, Itamar Zimerman, and Eliya Nachmani. Differential mamba.arXiv preprint arXiv:2507.06204,

work page arXiv
[50]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020.https://arxiv.org/abs/2002.05202. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[51]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Speed Always Wins: a survey on efficient architectures for large language models

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834,

work page arXiv
[53]

Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

Armin W Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, and Michael Poli. Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

work page arXiv
[54]

An Empirical Study of Mamba-based Language Models

Roger Waleffe, Wonmin Byeon, DuncanRiach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887,

work page internal anchor Pith review arXiv
[55]

org/abs/2507.06457

Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457,

work page arXiv
[56]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024a. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

Liu Xiao, Li Zhiyuan, and Lin Yueyu. Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

work page arXiv
[59]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

work page arXiv
[61]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Differential transformer, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258,

work page arXiv
[63]

Lolcats: On low- rank linearizing of large language models

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

work page arXiv
[64]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

work page arXiv
[66]

Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures

and Llama 4 to polish the overall writing after drafting the paper ourselves. Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures. C Details for Computational and Memory Costs Comparison FLOPs per sample.Most parameters are linear weight matrices, so total FLOPs can be estimated by mu...

work page 2022
[67]

We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales

and the Mamba architectures (Dao and Gu, 2024). We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales. Table6 provides a summary of the detailed architectures for both the Transformer and Mamba models, which serve as the foundational computational primitives for our hybrid architecture variants. Base Configuration ...

work page 2024
[68]

and the PG19 datasets (Rae et al., 2019b). Additionally, following the settings of prior work (Gemini Team et al., 2024; Ho et al., 2024), we conduct evaluations on the Needle-In-a- Haystack long-contextbenchmark(Kamradt,2023;Kuratovetal., 2024). Wefurthermeasure few-shotaccuracy on five benchmarks using the Language Model Evaluation Harness (Gao et al., ...

work page 2024

[1] [1]

Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J Yadwadkar, and Carole-Jean Wu. Composer: A search framework for hybrid neural architecture design.arXiv preprint arXiv:2510.00379,

work page arXiv

[2] [2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adi Renduchintala, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, et al. Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.arXiv preprint arXiv:2508.14444,

work page internal anchor Pith review arXiv

[5] [5]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[7] [7]

Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba.arXiv preprint arXiv:2406.14528,

work page arXiv

[8] [8]

L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P

Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, et al. Recurrentgemma: Moving past transformers for efficient open language models.arXiv preprint arXiv:2404.07839,

work page arXiv

[9] [9]

Command a: An enterprise-ready large language model,

Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

work page arXiv

[10] [10]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Hymba: A hybrid-head architecture for small language models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

work page arXiv

[14] [14]

Hungry hungry hippos: To- wards language modeling with state space models

11 Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models.arXiv preprint arXiv:2212.14052,

work page arXiv

[15] [15]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024

[16] [16]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Google Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Gemma 3 Technical Report

Google Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

The zamba2 suite: Technical report

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, Anna Golubeva, Vasudev Shyam, James Whittington, Jonathan Pilault, and Beren Millidge. The zamba2 suite: Technical report.arXiv preprint arXiv:2411.15242, 2024a. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybr...

work page arXiv

[19] [19]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021a. Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers.Advances in neural infor...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884,

work page arXiv

[21] [21]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, and Fei Richard Yu. Rwkv-x: A linear complexity hybrid language model.arXiv preprint arXiv:2504.21463,

work page arXiv

[23] [23]

Hunyuan-TurboS: Advancing large language models through mamba-transformer synergy and adaptive chain-of-thought,

Tencent Hunyuan Team, Ao Liu, Botong Zhou, Can Xu, Chayse Zhou, ChenChen Zhang, Chengcheng Xu, Chenhao Wang, Decheng Wu, Dengpeng Wu, et al. Hunyuan-turbos: Advancing large language models through mamba- transformer synergy and adaptive chain-of-thought.arXiv preprint arXiv:2505.15431,

work page arXiv

[24] [24]

Jamba-1.5: Hybrid transformer-mamba models at scale

Ai2 Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

work page arXiv

[25] [25]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, 12 Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.https://arxiv....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Scaling Laws for Neural Language Models

Accessed: 2023-10-31. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496,

work page arXiv

[28] [28]

MiniMax-01: Scaling Foundation Models with Lightning Attention

Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Yitzhak Gadre, Hritik Bansal, Etash Guha, Sedrick Scott Keh, K...

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b

Yixing Li, Ruobing Xie, Zhen Yang, Xingwu Sun, Shuaipeng Li, Weidong Han, Zhanhui Kang, Yu Cheng, Chengzhong Xu, Di Wang, et al. Transmamba: Flexibly switching between transformer and mamba.arXiv preprint arXiv:2503.24067, 2025b. Wanchao Liang, Tianyu Liu, Less Wright, Will Constable, Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng, Howard Huang, Junjie...

work page arXiv

[30] [30]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.arXiv preprint arXiv:2307.03172,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

The Llama 3 Herd of Models

Meta Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, et al. Moba: Mixture of block attention for long-context llms.arXiv preprint arXiv:2502.13189,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Phi-4 Technical Report

Microsoft Research. Phi-4 technical report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

OpenAI o1 System Card

OpenaAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2412.09871 , year=

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, et al. Byte latent transformer: Patches scale better than tokens.arXiv preprint arXiv:2412.09871,

work page arXiv

[39] [39]

Rwkv-7" goose" with expressive dynamic state evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv

[40] [40]

arXiv preprint arXiv:2403.17844 , year=

13 Michael Poli, Armin W Thomas, Eric Nguyen, Pragaash Ponnusamy, Björn Deiseroth, Kristian Kersting, Taiji Suzuki, Brian Hie, Stefano Ermon, Christopher Ré, et al. Mechanistic design and scaling of hybrid architectures.arXiv preprint arXiv:2403.17844,

work page arXiv

[41] [41]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation.arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

The devil in linear transformer.arXiv preprint arXiv:2210.10340,

Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer.arXiv preprint arXiv:2210.10340,

work page arXiv

[43] [43]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658,

work page arXiv

[44] [44]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint, 2019a.https://arxiv.org/abs/1911.05507. Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling.arXiv preprint arX...

work page internal anchor Pith review arXiv 1911

[47] [47]

InInternational Conference on Learning Representations (ICLR)

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

work page arXiv

[48] [48]

Decoder-hybrid-decoder architecture for efficient reasoning with long generation.arXiv preprint arXiv:2507.06607, 2025

Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, et al. Decoder-hybrid-decoder architecture for efficient reasoning with long generation. arXiv preprint arXiv:2507.06607,

work page arXiv

[49] [49]

Differential mamba.arXiv preprint arXiv:2507.06204,

Nadav Schneider, Itamar Zimerman, and Eliya Nachmani. Differential mamba.arXiv preprint arXiv:2507.06204,

work page arXiv

[50] [50]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.CoRR, abs/2002.05202, 2020.https://arxiv.org/abs/2002.05202. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[51] [51]

Simplified State Space Layers for Sequence Modeling

Jimmy TH Smith, Andrew Warrington, and Scott W Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933,

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

Speed Always Wins: a survey on efficient architectures for large language models

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834,

work page arXiv

[53] [53]

Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

Armin W Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli, and Michael Poli. Star: Synthesis of tailored architectures.arXiv preprint arXiv:2411.17800,

work page arXiv

[54] [54]

An Empirical Study of Mamba-based Language Models

Roger Waleffe, Wonmin Byeon, DuncanRiach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887,

work page internal anchor Pith review arXiv

[55] [55]

org/abs/2507.06457

Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457,

work page arXiv

[56] [56]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024a. Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

Liu Xiao, Li Zhiyuan, and Lin Yueyu. Wuneng: Hybrid state with attention.arXiv preprint arXiv:2504.19191,

work page arXiv

[59] [59]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770,

work page internal anchor Pith review Pith/arXiv arXiv

[60] [60]

Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272,

work page arXiv

[61] [61]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

Differential transformer, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258,

work page arXiv

[63] [63]

Lolcats: On low- rank linearizing of large language models

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254,

work page arXiv

[64] [64]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly.arXiv preprint arXiv:2402.09371,

work page arXiv

[66] [66]

Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures

and Llama 4 to polish the overall writing after drafting the paper ourselves. Additionally, we utilized the same models for vibe coding when fixing bugs or when making the initial drafts of the figures. C Details for Computational and Memory Costs Comparison FLOPs per sample.Most parameters are linear weight matrices, so total FLOPs can be estimated by mu...

work page 2022

[67] [67]

We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales

and the Mamba architectures (Dao and Gu, 2024). We primarily follow to the configurations of Llama 3.2 and Mamba 2 across various model scales. Table6 provides a summary of the detailed architectures for both the Transformer and Mamba models, which serve as the foundational computational primitives for our hybrid architecture variants. Base Configuration ...

work page 2024

[68] [68]

and the PG19 datasets (Rae et al., 2019b). Additionally, following the settings of prior work (Gemini Team et al., 2024; Ho et al., 2024), we conduct evaluations on the Needle-In-a- Haystack long-contextbenchmark(Kamradt,2023;Kuratovetal., 2024). Wefurthermeasure few-shotaccuracy on five benchmarks using the Language Model Evaluation Harness (Gao et al., ...

work page 2024