pith. sign in

arxiv: 2606.30562 · v1 · pith:7SV6WX6Snew · submitted 2026-06-29 · 💻 cs.CL

Morphing into Hybrid Attention Models

Pith reviewed 2026-06-30 05:54 UTC · model grok-4.3

classification 💻 cs.CL
keywords hybrid attentionlayer selectionlinear attentionlong-context modelingtransformer conversionattention morphingFlashMorph
0
0 comments X

The pith

FlashMorph optimizes hybrid attention layer selection by jointly training gates on synthetic data instead of heuristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the choice of which Transformer layers keep full attention during conversion to hybrid models as a budget-constrained subset selection problem. It introduces FlashMorph, which equips each layer with a parallel linear-attention branch, freezes the weights, and optimizes layerwise gates on synthetic long-context retrieval tasks under a linearization penalty that pushes the model toward efficiency. After discretizing the gates to a fixed full-attention budget, the resulting hybrid undergoes standard distillation and finetuning. Experiments indicate the method identifies stronger layer mixes than prior heuristics while preserving recall on long-context tasks and reducing the computational expense of the selection step itself.

Core claim

FlashMorph constructs a morphable model by adding a converted linear-attention branch to every full-attention layer. With all weights frozen, it jointly optimizes layerwise gates on synthetic long-context retrieval data together with a linearization regularization term that encourages reliance on the linear branch. The learned gates are discretized under a preset full-attention budget to produce the hybrid architecture, which is then refined by logits distillation and long-context finetuning. This procedure is shown to yield hybrid configurations that maintain strong long-context recall and general benchmark scores at substantially lower layer-selection cost than existing methods.

What carries the argument

Layerwise gates in a morphable model that are jointly optimized on synthetic retrieval data with linearization regularization before discretization under a budget constraint.

If this is right

  • Hybrid configurations discovered by FlashMorph outperform those from fixed patterns or isolated layer scoring.
  • Long-context recall and general benchmark performance remain comparable to the original full-attention model.
  • The computational cost of identifying the hybrid layer set drops substantially relative to prior selection techniques.
  • The same morphable-model construction and gate optimization can be applied at different full-attention budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The joint-optimization view of layer interdependencies could be reused for other architecture decisions such as choosing which layers to quantize or prune.
  • Because the method relies on synthetic data, it may enable rapid creation of task-specialized hybrids without access to large labeled corpora.
  • If the learned gates encode global layer interactions, similar differentiable selection could improve efficiency in non-attention components of large models.

Load-bearing premise

Optimizing the gates on synthetic long-context retrieval data with frozen weights and linearization regularization produces gates whose discretization yields a hybrid model that generalizes after distillation and finetuning.

What would settle it

If the hybrid architecture obtained by discretizing FlashMorph gates performs worse than a heuristic-selected hybrid on long-context recall benchmarks after identical distillation and finetuning, the claim of superior layer selection would be falsified.

read the original abstract

Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formulates hybrid layer selection as a budget-constrained subset optimization problem and proposes FlashMorph: equip each layer with a parallel linear-attention branch, freeze weights, jointly optimize continuous layerwise gates on synthetic long-context retrieval data plus a linearization regularization term, discretize under a full-attention budget, then apply logits distillation and long-context finetuning. It claims the resulting hybrids outperform heuristic and scoring-based selections on long-context recall and general benchmarks while lowering selection cost.

Significance. If the central empirical claim holds, the work supplies a scalable, optimization-based alternative to heuristic layer selection that explicitly models inter-layer dependencies under a global budget. The use of synthetic data and explicit regularization is a methodological strength; successful generalization would meaningfully advance practical Transformer-to-hybrid conversion pipelines.

major comments (3)
  1. [Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.
  2. [Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.
  3. [§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.
minor comments (2)
  1. [§3] Notation for the gate variables and the linearization regularization coefficient should be introduced with explicit symbols and ranges in the method section rather than only in prose.
  2. [§3.2] The synthetic data construction (retrieval examples) is described at high level; a short appendix table listing prompt length, number of examples, and retrieval accuracy of the frozen model before gate optimization would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.

    Authors: We agree there is no theoretical guarantee that the synthetic-data optimum will align with the post-adaptation optimum. The claim rests on empirical validation: the final hybrids, after discretization, distillation, and long-context finetuning, outperform the baselines on both long-context recall and general benchmarks. To make this evidence explicit, we will add an ablation that reports performance of the selected configurations immediately after discretization (pre-finetuning) versus after the full adaptation pipeline, and we will compare against the same baselines at both stages. revision: yes

  2. Referee: [Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.

    Authors: Section 4 already details the experimental protocol, including the synthetic retrieval datasets, baseline methods (heuristic patterns and layerwise scoring), number of runs, and error bars. The abstract follows the conventional practice of summarizing findings at a high level. We will revise the abstract to include a small number of key quantitative highlights (e.g., average recall improvement and selection-cost reduction) while keeping it concise. revision: yes

  3. Referee: [§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.

    Authors: We will add a dedicated analysis subsection that examines (i) sensitivity of final performance to small changes in the discretization threshold and (ii) a direct comparison of top-k versus threshold-based discretization, reporting post-finetuning metrics for each variant. This will quantify the stability of the selected configurations. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical optimization procedure (gate learning on frozen weights over synthetic retrieval data, followed by discretization, distillation and finetuning) whose outputs are evaluated on independent benchmarks. No equations, definitions or self-citations reduce the reported performance numbers to quantities defined by the same fitted gates; the central claim rests on post-adaptation experimental results rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the premise that synthetic retrieval data plus regularization can proxy real long-context behavior; several hyperparameters (budget, regularization coefficient, gate discretization threshold) are introduced without independent justification in the abstract.

free parameters (2)
  • full-attention budget
    Preset constraint on number of full-attention layers; chosen before optimization.
  • linearization regularization coefficient
    Controls how strongly the model is pushed to use linear branches during gate training.
axioms (2)
  • domain assumption Layer importance under hybrid configuration is interdependent and cannot be scored independently
    Explicitly stated as the motivation for moving beyond layerwise scoring.
  • domain assumption Synthetic long-context retrieval data is sufficient to learn useful gates
    Used as the sole training signal for the gates before discretization.

pith-pipeline@v0.9.1-grok · 5778 in / 1372 out tokens · 25755 ms · 2026-06-30T05:54:59.459661+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 46 canonical work pages · 20 internal anchors

  1. [1]

    Language models enable simple systems for generating structured views of heterogeneous data lakes

    Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023

  2. [2]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024

  3. [3]

    Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

    Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

  4. [4]

    Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

    Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

  5. [5]

    Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

    Aviv Bick, Eric P Xing, and Albert Gu. Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

  6. [6]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  7. [7]

    Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

    Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

  8. [8]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

  9. [9]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  10. [10]

    Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

    Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, and Yunhe Wang. Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

  11. [11]

    Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156,

    Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026

  12. [12]

    Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

    Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, and Guoqi Li. Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

  13. [13]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  14. [14]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

  15. [15]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

  16. [16]

    Native Hybrid Attention for Efficient Sequence Modeling

    Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling. arXiv preprint arXiv:2510.07019, 2025

  17. [17]

    Mom: Linear sequence modeling with mixture-of- memories

    Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of- memories. arXiv preprint arXiv:2502.13685, 2025

  18. [18]

    The language model evaluation harness, July 2024

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, 14 Kevin Wang, and Andy Zou. The lang...

  19. [19]

    Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

  20. [20]

    Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

    Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

  21. [21]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  22. [22]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

  23. [23]

    Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

    Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

  24. [24]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  25. [25]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  26. [26]

    Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

    Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

  27. [27]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

  28. [28]

    Finetuning pretrained transformers into rnns

    Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021

  29. [29]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–

  30. [30]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  31. [31]

    Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

    Aakash Lahoti, Kevin Y Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

  32. [32]

    Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

    Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

  33. [33]

    Datacomp-lm: In search of the next generation of training sets for language models

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024

  34. [34]

    Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

    Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

  35. [35]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

  36. [36]

    Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026

    Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, and Min Zhang. Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026. 15

  37. [37]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  38. [38]

    Openceres: When open information extraction meets the semi-structured web

    Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. Openceres: When open information extraction meets the semi-structured web. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 3047–3056, 2019

  39. [39]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  40. [40]

    Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

    Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

  41. [41]

    Olmo Hybrid: From Theory to Practice and Back

    William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444, 2026

  42. [42]

    Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

    Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y Li, Aviv Bick, J Zico Kolter, Albert Gu, François Fleuret, and Tri Dao. Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

  43. [43]

    Rwkv: Reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

  44. [44]

    Yarn: Efficient context window extension of large language models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternationalConference on Learning Representations, volume 2024, pages 31932–31951, 2024

  45. [45]

    Hierarchically gated recurrent neural network for sequence modeling

    Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advancesin Neural Information Processing Systems, 36:33202–33221, 2023

  46. [46]

    Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

    Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

  47. [47]

    Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

    Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

  48. [48]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

  49. [49]

    Qwen3-coder-next technical report

    Qwen Team. Qwen3-coder-next technical report. Technical report. URL https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026-02-03

  50. [50]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

  51. [51]

    Know what you don’t know: Unanswerable questions for squad

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

  52. [52]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling

    Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575, 2025

  53. [53]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  54. [54]

    Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

    Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

  55. [55]

    Linear-moe: Linear sequence modeling meets mixture-of-experts

    Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447, 2025. 16

  56. [56]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

  57. [57]

    Attention is all you need.Advancesin neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

  58. [58]

    A Systematic Analysis of Hybrid Linear Attention

    Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025

  59. [59]

    The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

  60. [60]

    Rnns are not transformers (yet): The key bottleneck on in-context retrieval

    Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. InInternational Conference on Learning Representations, volume 2025, pages 48813–48856, 2025

  61. [61]

    Duoattention: Efficient long-context llm inference with retrieval and streaming heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. InInternational Conference on Learning Representations, volume 2025, pages 37228–37253, 2025

  62. [62]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  63. [63]

    Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

    Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

  64. [64]

    Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024

    Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URLhttps://github.com/fla-org/flash-linear-attention

  65. [65]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

  66. [66]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  67. [67]

    Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

    Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

  68. [68]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  69. [69]

    Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

    Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

  70. [70]

    The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

    Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

  71. [71]

    Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

    Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

  72. [72]

    Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025

    Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025. 17 Appendix A Model and Training Configuration We report the com...