Morphing into Hybrid Attention Models
Pith reviewed 2026-06-30 05:54 UTC · model grok-4.3
The pith
FlashMorph optimizes hybrid attention layer selection by jointly training gates on synthetic data instead of heuristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FlashMorph constructs a morphable model by adding a converted linear-attention branch to every full-attention layer. With all weights frozen, it jointly optimizes layerwise gates on synthetic long-context retrieval data together with a linearization regularization term that encourages reliance on the linear branch. The learned gates are discretized under a preset full-attention budget to produce the hybrid architecture, which is then refined by logits distillation and long-context finetuning. This procedure is shown to yield hybrid configurations that maintain strong long-context recall and general benchmark scores at substantially lower layer-selection cost than existing methods.
What carries the argument
Layerwise gates in a morphable model that are jointly optimized on synthetic retrieval data with linearization regularization before discretization under a budget constraint.
If this is right
- Hybrid configurations discovered by FlashMorph outperform those from fixed patterns or isolated layer scoring.
- Long-context recall and general benchmark performance remain comparable to the original full-attention model.
- The computational cost of identifying the hybrid layer set drops substantially relative to prior selection techniques.
- The same morphable-model construction and gate optimization can be applied at different full-attention budgets.
Where Pith is reading between the lines
- The joint-optimization view of layer interdependencies could be reused for other architecture decisions such as choosing which layers to quantize or prune.
- Because the method relies on synthetic data, it may enable rapid creation of task-specialized hybrids without access to large labeled corpora.
- If the learned gates encode global layer interactions, similar differentiable selection could improve efficiency in non-attention components of large models.
Load-bearing premise
Optimizing the gates on synthetic long-context retrieval data with frozen weights and linearization regularization produces gates whose discretization yields a hybrid model that generalizes after distillation and finetuning.
What would settle it
If the hybrid architecture obtained by discretizing FlashMorph gates performs worse than a heuristic-selected hybrid on long-context recall benchmarks after identical distillation and finetuning, the claim of superior layer selection would be falsified.
read the original abstract
Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formulates hybrid layer selection as a budget-constrained subset optimization problem and proposes FlashMorph: equip each layer with a parallel linear-attention branch, freeze weights, jointly optimize continuous layerwise gates on synthetic long-context retrieval data plus a linearization regularization term, discretize under a full-attention budget, then apply logits distillation and long-context finetuning. It claims the resulting hybrids outperform heuristic and scoring-based selections on long-context recall and general benchmarks while lowering selection cost.
Significance. If the central empirical claim holds, the work supplies a scalable, optimization-based alternative to heuristic layer selection that explicitly models inter-layer dependencies under a global budget. The use of synthetic data and explicit regularization is a methodological strength; successful generalization would meaningfully advance practical Transformer-to-hybrid conversion pipelines.
major comments (3)
- [Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.
- [Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.
- [§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.
minor comments (2)
- [§3] Notation for the gate variables and the linearization regularization coefficient should be introduced with explicit symbols and ranges in the method section rather than only in prose.
- [§3.2] The synthetic data construction (retrieval examples) is described at high level; a short appendix table listing prompt length, number of examples, and retrieval accuracy of the frozen model before gate optimization would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.
Authors: We agree there is no theoretical guarantee that the synthetic-data optimum will align with the post-adaptation optimum. The claim rests on empirical validation: the final hybrids, after discretization, distillation, and long-context finetuning, outperform the baselines on both long-context recall and general benchmarks. To make this evidence explicit, we will add an ablation that reports performance of the selected configurations immediately after discretization (pre-finetuning) versus after the full adaptation pipeline, and we will compare against the same baselines at both stages. revision: yes
-
Referee: [Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.
Authors: Section 4 already details the experimental protocol, including the synthetic retrieval datasets, baseline methods (heuristic patterns and layerwise scoring), number of runs, and error bars. The abstract follows the conventional practice of summarizing findings at a high level. We will revise the abstract to include a small number of key quantitative highlights (e.g., average recall improvement and selection-cost reduction) while keeping it concise. revision: yes
-
Referee: [§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.
Authors: We will add a dedicated analysis subsection that examines (i) sensitivity of final performance to small changes in the discretization threshold and (ii) a direct comparison of top-k versus threshold-based discretization, reporting post-finetuning metrics for each variant. This will quantify the stability of the selected configurations. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an empirical optimization procedure (gate learning on frozen weights over synthetic retrieval data, followed by discretization, distillation and finetuning) whose outputs are evaluated on independent benchmarks. No equations, definitions or self-citations reduce the reported performance numbers to quantities defined by the same fitted gates; the central claim rests on post-adaptation experimental results rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
free parameters (2)
- full-attention budget
- linearization regularization coefficient
axioms (2)
- domain assumption Layer importance under hybrid configuration is interdependent and cannot be scored independently
- domain assumption Synthetic long-context retrieval data is sufficient to learn useful gates
Reference graph
Works this paper leans on
-
[1]
Language models enable simple systems for generating structured views of heterogeneous data lakes
Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023
-
[2]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024
-
[3]
Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024
Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024
2024
-
[4]
Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025
-
[5]
Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026
Aviv Bick, Eric P Xing, and Albert Gu. Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026
-
[6]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
2020
-
[7]
Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025
-
[8]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, and Yunhe Wang. Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024
-
[11]
Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026
-
[12]
Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024
Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, and Guoqi Li. Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024
2024
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Native Hybrid Attention for Efficient Sequence Modeling
Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling. arXiv preprint arXiv:2510.07019, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Mom: Linear sequence modeling with mixture-of- memories
Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of- memories. arXiv preprint arXiv:2502.13685, 2025
-
[18]
The language model evaluation harness, 07 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, 14 Kevin Wang, and Andy Zou. The lang...
-
[19]
Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024
-
[20]
Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025
-
[21]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025
-
[24]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025
Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025
-
[27]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Finetuning pretrained transformers into rnns
Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021
2021
-
[29]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–
-
[30]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[31]
Aakash Lahoti, Kevin Y Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026
-
[32]
Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025
-
[33]
Datacomp-lm: In search of the next generation of training sets for language models
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024
2024
-
[34]
Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025
-
[35]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, and Min Zhang. Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026. 15
-
[37]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Openceres: When open information extraction meets the semi-structured web
Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. Openceres: When open information extraction meets the semi-structured web. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 3047–3056, 2019
2019
-
[39]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024
Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024
-
[41]
Olmo Hybrid: From Theory to Practice and Back
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y Li, Aviv Bick, J Zico Kolter, Albert Gu, François Fleuret, and Tri Dao. Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025
-
[43]
Rwkv: Reinventing rnns for the transformer era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023
2023
-
[44]
Yarn: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternationalConference on Learning Representations, volume 2024, pages 31932–31951, 2024
2024
-
[45]
Hierarchically gated recurrent neural network for sequence modeling
Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advancesin Neural Information Processing Systems, 36:33202–33221, 2023
2023
-
[46]
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024
-
[47]
Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024
-
[48]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026
2026
-
[49]
Qwen3-coder-next technical report
Qwen Team. Qwen3-coder-next technical report. Technical report. URL https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026-02-03
2026
-
[50]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5
2026
-
[51]
Know what you don’t know: Unanswerable questions for squad
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018
2018
-
[52]
Samba: Simple hybrid state space models for efficient unlimited context language modeling
Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575, 2025
2025
-
[53]
Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[54]
Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025
-
[55]
Linear-moe: Linear sequence modeling meets mixture-of-experts
Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447, 2025. 16
-
[56]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Attention is all you need.Advancesin neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017
2017
-
[58]
A Systematic Analysis of Hybrid Linear Attention
Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024
2024
-
[60]
Rnns are not transformers (yet): The key bottleneck on in-context retrieval
Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. InInternational Conference on Learning Representations, volume 2025, pages 48813–48856, 2025
2025
-
[61]
Duoattention: Efficient long-context llm inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. InInternational Conference on Learning Representations, volume 2025, pages 37228–37253, 2025
2025
-
[62]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026
Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026
2026
-
[64]
Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024
Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URLhttps://github.com/fla-org/flash-linear-attention
2024
-
[65]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024
2024
-
[68]
Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
2019
-
[69]
Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024
Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024
-
[70]
Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024
-
[71]
Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024
Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024
2024
-
[72]
Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025. 17 Appendix A Model and Training Configuration We report the com...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.