Morphing into Hybrid Attention Models

Disen Lan; Jianbin Zheng; Xin Xia; Xipeng Qiu; Xuanda Wang; Xuefeng Xiao; Yu Cheng; Yuxi Ren

arxiv: 2606.30562 · v1 · pith:7SV6WX6Snew · submitted 2026-06-29 · 💻 cs.CL

Morphing into Hybrid Attention Models

Disen Lan , Jianbin Zheng , Yuxi Ren , Xin Xia , Xuanda Wang , Xuefeng Xiao , Xipeng Qiu , Yu Cheng This is my paper

Pith reviewed 2026-06-30 05:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords hybrid attentionlayer selectionlinear attentionlong-context modelingtransformer conversionattention morphingFlashMorph

0 comments

The pith

FlashMorph optimizes hybrid attention layer selection by jointly training gates on synthetic data instead of heuristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the choice of which Transformer layers keep full attention during conversion to hybrid models as a budget-constrained subset selection problem. It introduces FlashMorph, which equips each layer with a parallel linear-attention branch, freezes the weights, and optimizes layerwise gates on synthetic long-context retrieval tasks under a linearization penalty that pushes the model toward efficiency. After discretizing the gates to a fixed full-attention budget, the resulting hybrid undergoes standard distillation and finetuning. Experiments indicate the method identifies stronger layer mixes than prior heuristics while preserving recall on long-context tasks and reducing the computational expense of the selection step itself.

Core claim

FlashMorph constructs a morphable model by adding a converted linear-attention branch to every full-attention layer. With all weights frozen, it jointly optimizes layerwise gates on synthetic long-context retrieval data together with a linearization regularization term that encourages reliance on the linear branch. The learned gates are discretized under a preset full-attention budget to produce the hybrid architecture, which is then refined by logits distillation and long-context finetuning. This procedure is shown to yield hybrid configurations that maintain strong long-context recall and general benchmark scores at substantially lower layer-selection cost than existing methods.

What carries the argument

Layerwise gates in a morphable model that are jointly optimized on synthetic retrieval data with linearization regularization before discretization under a budget constraint.

If this is right

Hybrid configurations discovered by FlashMorph outperform those from fixed patterns or isolated layer scoring.
Long-context recall and general benchmark performance remain comparable to the original full-attention model.
The computational cost of identifying the hybrid layer set drops substantially relative to prior selection techniques.
The same morphable-model construction and gate optimization can be applied at different full-attention budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint-optimization view of layer interdependencies could be reused for other architecture decisions such as choosing which layers to quantize or prune.
Because the method relies on synthetic data, it may enable rapid creation of task-specialized hybrids without access to large labeled corpora.
If the learned gates encode global layer interactions, similar differentiable selection could improve efficiency in non-attention components of large models.

Load-bearing premise

Optimizing the gates on synthetic long-context retrieval data with frozen weights and linearization regularization produces gates whose discretization yields a hybrid model that generalizes after distillation and finetuning.

What would settle it

If the hybrid architecture obtained by discretizing FlashMorph gates performs worse than a heuristic-selected hybrid on long-context recall benchmarks after identical distillation and finetuning, the claim of superior layer selection would be falsified.

read the original abstract

Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlashMorph frames hybrid layer selection as joint gate optimization on synthetic data, which is a cleaner formulation than prior heuristics, but the abstract gives no numbers so the performance claims stay unverified.

read the letter

The main thing here is a shift from fixed patterns or per-layer scores to treating the choice of full-attention layers as a single budget-constrained subset problem. They build a morphable model with parallel linear branches, freeze the weights, train continuous gates on synthetic retrieval examples plus a linearization penalty, then discretize and distill. That joint optimization step plus the regularization is the concrete difference from the methods cited in the abstract.

The approach is straightforward and the synthetic-data stage looks cheap enough to be practical. If the gates really do pick better subsets than scoring baselines once the model is allowed to adapt, it would be a useful recipe for people converting existing long-context models.

The soft spot is exactly the one in the stress-test note. Optimizing gates while everything is frozen on synthetic data does not automatically guarantee that the same subset stays best after logits distillation and long-context finetuning. The regularization could push toward linear layers in ways that later training undoes or compensates for. Without seeing the actual tables, ablations, or error bars, there is no way to tell whether the claimed gains survive that second stage or whether they are artifacts of the synthetic setup.

This is the kind of paper that belongs in a reading group for people working on efficient Transformers. It deserves a serious referee because the formulation is explicit and the procedure is reproducible in principle, even if the current evidence is thin. I would send it out rather than desk-reject, but I would ask the authors for the missing experimental details before deciding how much weight to give the results.

Referee Report

3 major / 2 minor

Summary. The paper formulates hybrid layer selection as a budget-constrained subset optimization problem and proposes FlashMorph: equip each layer with a parallel linear-attention branch, freeze weights, jointly optimize continuous layerwise gates on synthetic long-context retrieval data plus a linearization regularization term, discretize under a full-attention budget, then apply logits distillation and long-context finetuning. It claims the resulting hybrids outperform heuristic and scoring-based selections on long-context recall and general benchmarks while lowering selection cost.

Significance. If the central empirical claim holds, the work supplies a scalable, optimization-based alternative to heuristic layer selection that explicitly models inter-layer dependencies under a global budget. The use of synthetic data and explicit regularization is a methodological strength; successful generalization would meaningfully advance practical Transformer-to-hybrid conversion pipelines.

major comments (3)

[Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.
[Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.
[§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.

minor comments (2)

[§3] Notation for the gate variables and the linearization regularization coefficient should be introduced with explicit symbols and ranges in the method section rather than only in prose.
[§3.2] The synthetic data construction (retrieval examples) is described at high level; a short appendix table listing prompt length, number of examples, and retrieval accuracy of the frozen model before gate optimization would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method description (abstract and §3)] The central claim requires that gates optimized on frozen weights and synthetic retrieval data remain superior after discretization, distillation, and long-context finetuning. The procedure described (freeze, optimize, discretize, then adapt) contains no guarantee or ablation that the synthetic optimum aligns with the post-adaptation optimum; the linearization regularizer could bias selections that finetuning later reverses. This is load-bearing for the superiority claim.

Authors: We agree there is no theoretical guarantee that the synthetic-data optimum will align with the post-adaptation optimum. The claim rests on empirical validation: the final hybrids, after discretization, distillation, and long-context finetuning, outperform the baselines on both long-context recall and general benchmarks. To make this evidence explicit, we will add an ablation that reports performance of the selected configurations immediately after discretization (pre-finetuning) versus after the full adaptation pipeline, and we will compare against the same baselines at both stages. revision: yes
Referee: [Abstract and §4 (Experiments)] Abstract asserts 'extensive experiments show' superiority and reduced cost, yet supplies no quantitative results, baselines, datasets, number of runs, or error bars. Without these, it is impossible to assess whether the reported gains survive multiple-testing correction or post-hoc configuration choices.

Authors: Section 4 already details the experimental protocol, including the synthetic retrieval datasets, baseline methods (heuristic patterns and layerwise scoring), number of runs, and error bars. The abstract follows the conventional practice of summarizing findings at a high level. We will revise the abstract to include a small number of key quantitative highlights (e.g., average recall improvement and selection-cost reduction) while keeping it concise. revision: yes
Referee: [§3.3 (Discretization)] The discretization step under a preset budget is presented as producing the final hybrid, but no analysis shows that the continuous-gate optimum is stable to the discretization threshold or that alternative discretizations (e.g., top-k by gate value vs. learned threshold) yield materially different post-finetuning performance.

Authors: We will add a dedicated analysis subsection that examines (i) sensitivity of final performance to small changes in the discretization threshold and (ii) a direct comparison of top-k versus threshold-based discretization, reporting post-finetuning metrics for each variant. This will quantify the stability of the selected configurations. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an empirical optimization procedure (gate learning on frozen weights over synthetic retrieval data, followed by discretization, distillation and finetuning) whose outputs are evaluated on independent benchmarks. No equations, definitions or self-citations reduce the reported performance numbers to quantities defined by the same fitted gates; the central claim rests on post-adaptation experimental results rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the premise that synthetic retrieval data plus regularization can proxy real long-context behavior; several hyperparameters (budget, regularization coefficient, gate discretization threshold) are introduced without independent justification in the abstract.

free parameters (2)

full-attention budget
Preset constraint on number of full-attention layers; chosen before optimization.
linearization regularization coefficient
Controls how strongly the model is pushed to use linear branches during gate training.

axioms (2)

domain assumption Layer importance under hybrid configuration is interdependent and cannot be scored independently
Explicitly stated as the motivation for moving beyond layerwise scoring.
domain assumption Synthetic long-context retrieval data is sufficient to learn useful gates
Used as the sole training signal for the gates before discretization.

pith-pipeline@v0.9.1-grok · 5778 in / 1372 out tokens · 25755 ms · 2026-06-30T05:54:59.459661+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 46 canonical work pages · 20 internal anchors

[1]

Language models enable simple systems for generating structured views of heterogeneous data lakes

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023

work page arXiv 2023
[2]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024

work page arXiv 2024
[3]

Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

2024
[4]

Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

work page arXiv 2025
[5]

Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

Aviv Bick, Eric P Xing, and Albert Gu. Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

work page arXiv 2026
[6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020
[7]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

work page arXiv 2025
[8]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, and Yunhe Wang. Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

work page arXiv 2024
[11]

Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026

Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026

work page arXiv 2026
[12]

Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, and Guoqi Li. Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

2024
[13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling. arXiv preprint arXiv:2510.07019, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Mom: Linear sequence modeling with mixture-of- memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of- memories. arXiv preprint arXiv:2502.13685, 2025

work page arXiv 2025
[18]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, 14 Kevin Wang, and Andy Zou. The lang...

work page arXiv 2024
[19]

Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

work page arXiv 2024
[20]

Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

work page arXiv 2025
[21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

work page arXiv 2025
[24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

work page arXiv 2025
[27]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Finetuning pretrained transformers into rnns

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021

2021
[29]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–
[30]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[31]

Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

Aakash Lahoti, Kevin Y Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

work page arXiv 2026
[32]

Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

work page arXiv 2025
[33]

Datacomp-lm: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024

2024
[34]

Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

work page arXiv 2025
[35]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026

Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, and Min Zhang. Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026. 15

work page arXiv 2026
[37]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Openceres: When open information extraction meets the semi-structured web

Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. Openceres: When open information extraction meets the semi-structured web. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 3047–3056, 2019

2019
[39]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

work page arXiv 2024
[41]

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y Li, Aviv Bick, J Zico Kolter, Albert Gu, François Fleuret, and Tri Dao. Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

work page arXiv 2025
[43]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

2023
[44]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternationalConference on Learning Representations, volume 2024, pages 31932–31951, 2024

2024
[45]

Hierarchically gated recurrent neural network for sequence modeling

Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advancesin Neural Information Processing Systems, 36:33202–33221, 2023

2023
[46]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

work page arXiv 2024
[47]

Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024
[48]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

2026
[49]

Qwen3-coder-next technical report

Qwen Team. Qwen3-coder-next technical report. Technical report. URL https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026-02-03

2026
[50]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

2026
[51]

Know what you don’t know: Unanswerable questions for squad

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

2018
[52]

Samba: Simple hybrid state space models for efficient unlimited context language modeling

Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575, 2025

2025
[53]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021
[54]

Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

work page arXiv 2025
[55]

Linear-moe: Linear sequence modeling meets mixture-of-experts

Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447, 2025. 16

work page arXiv 2025
[56]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017
[58]

A Systematic Analysis of Hybrid Linear Attention

Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

2024
[60]

Rnns are not transformers (yet): The key bottleneck on in-context retrieval

Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. InInternational Conference on Learning Representations, volume 2025, pages 48813–48856, 2025

2025
[61]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. InInternational Conference on Learning Representations, volume 2025, pages 37228–37253, 2025

2025
[62]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

2026
[64]

Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024

Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URLhttps://github.com/fla-org/flash-linear-attention

2024
[65]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

2024
[68]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

2019
[69]

Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

work page arXiv 2024
[70]

The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

work page arXiv 2024
[71]

Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

2024
[72]

Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025

Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025. 17 Appendix A Model and Training Configuration We report the com...

work page arXiv 2025

[1] [1]

Language models enable simple systems for generating structured views of heterogeneous data lakes

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. Language models enable simple systems for generating structured views of heterogeneous data lakes. arXiv preprint arXiv:2304.09433, 2023

work page arXiv 2023

[2] [2]

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668, 2024

work page arXiv 2024

[3] [3]

Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.Advancesin neural information processing systems, 37:31788–31812, 2024

2024

[4] [4]

Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025

work page arXiv 2025

[5] [5]

Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

Aviv Bick, Eric P Xing, and Albert Gu. Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026

work page arXiv 2026

[6] [6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

2020

[7] [7]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025

work page arXiv 2025

[8] [8]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, and Yunhe Wang. Dijiang: Efficient large language models through compact kernelization.arXiv preprint arXiv:2403.19928, 2024

work page arXiv 2024

[11] [11]

Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026

Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026

work page arXiv 2026

[12] [12]

Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

Yuhong Chou, Man Yao, Kexin Wang, Yuqi Pan, Ruijie Zhu, Yiran Zhong, Yu Qiao, Jibin Wu, Bo Xu, and Guoqi Li. Metala: Unified optimal linear approximation to softmax attention map.Advances in Neural Information Processing Systems, 37:71034–71067, 2024

2024

[13] [13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, and Yu Cheng. Native hybrid attention for efficient sequence modeling. arXiv preprint arXiv:2510.07019, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Mom: Linear sequence modeling with mixture-of- memories

Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, and Yu Cheng. Mom: Linear sequence modeling with mixture-of- memories. arXiv preprint arXiv:2502.13685, 2025

work page arXiv 2025

[18] [18]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, 14 Kevin Wang, and Andy Zou. The lang...

work page arXiv 2024

[19] [19]

Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024

work page arXiv 2024

[20] [20]

Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

Daniel Goldstein, Eric Alcaide, Janna Lu, and Eugene Cheah. Radlads: Rapid attention distillation to linear attention decoders at scale.arXiv preprint arXiv:2505.03005, 2025

work page arXiv 2025

[21] [21]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search.arXiv preprint arXiv:2508.15884, 2025

work page arXiv 2025

[24] [24]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Xiaqiang Tang, Qingsong Wen, Yuxuan Liang, and Weigao Sun. Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025

work page arXiv 2025

[27] [27]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Finetuning pretrained transformers into rnns

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 10630–10643, 2021

2021

[29] [29]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–

[30] [30]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023

[31] [31]

Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

Aakash Lahoti, Kevin Y Li, Berlin Chen, Caitlin Wang, Aviv Bick, J Zico Kolter, Tri Dao, and Albert Gu. Mamba-3: Improved sequence modeling using state space principles.arXiv preprint arXiv:2603.15569, 2026

work page arXiv 2026

[32] [32]

Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, and Yu Cheng. Liger: Linearizing large language models to gated recurrent structures.arXiv preprint arXiv:2503.01496, 2025

work page arXiv 2025

[33] [33]

Datacomp-lm: In search of the next generation of training sets for language models

Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models. Advancesin Neural Information Processing Systems, 37:14200–14282, 2024

2024

[34] [34]

Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025

work page arXiv 2025

[35] [35]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026

Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, and Min Zhang. Lycheedecode: Accelerating long-context llm inference via hybrid-head sparse decoding.arXiv preprint arXiv:2602.04541, 2026. 15

work page arXiv 2026

[37] [37]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Openceres: When open information extraction meets the semi-structured web

Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. Openceres: When open information extraction meets the semi-structured web. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Volume1 (Long and Short Papers), pages 3047–3056, 2019

2019

[39] [39]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[40] [40]

Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models.arXiv preprint arXiv:2405.06640, 2024

work page arXiv 2024

[41] [41]

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

Daniele Paliotta, Junxiong Wang, Matteo Pagliardini, Kevin Y Li, Aviv Bick, J Zico Kolter, Albert Gu, François Fleuret, and Tri Dao. Thinking slow, fast: Scaling inference compute with distilled reasoners.arXiv preprint arXiv:2502.20339, 2025

work page arXiv 2025

[43] [43]

Rwkv: Reinventing rnns for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023

2023

[44] [44]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternationalConference on Learning Representations, volume 2024, pages 31932–31951, 2024

2024

[45] [45]

Hierarchically gated recurrent neural network for sequence modeling

Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advancesin Neural Information Processing Systems, 36:33202–33221, 2023

2023

[46] [46]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

work page arXiv 2024

[47] [47]

Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, and Yiran Zhong. Hgrn2: Gated linear rnns with state expansion.arXiv preprint arXiv:2404.07904, 2024

work page arXiv 2024

[48] [48]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.Advances in Neural Information Processing Systems, 38:100092–100118, 2026

2026

[49] [49]

Qwen3-coder-next technical report

Qwen Team. Qwen3-coder-next technical report. Technical report. URL https://github.com/QwenLM/ Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026-02-03

2026

[50] [50]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps://qwen.ai/blog?id= qwen3.5

2026

[51] [51]

Know what you don’t know: Unanswerable questions for squad

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018

2018

[52] [52]

Samba: Simple hybrid state space models for efficient unlimited context language modeling

Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575, 2025

2025

[53] [53]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

2021

[54] [54]

Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

Weigao Sun, Jiaxi Hu, Yucheng Zhou, Jusen Du, Disen Lan, Kexin Wang, Tong Zhu, Xiaoye Qu, Yu Zhang, Xiaoyu Mo, et al. Speed always wins: A survey on efficient architectures for large language models.arXiv preprint arXiv:2508.09834, 2025

work page arXiv 2025

[55] [55]

Linear-moe: Linear sequence modeling meets mixture-of-experts

Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, and Yu Cheng. Linear-moe: Linear sequence modeling meets mixture-of-experts. arXiv preprint arXiv:2503.05447, 2025. 16

work page arXiv 2025

[56] [56]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Attention is all you need.Advancesin neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advancesin neural information processing systems, 30, 2017

2017

[58] [58]

A Systematic Analysis of Hybrid Linear Attention

Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advancesin Neural Information Processing Systems, 37:62432–62457, 2024

2024

[60] [60]

Rnns are not transformers (yet): The key bottleneck on in-context retrieval

Kaiyue Wen, Xingyu Dang, and Kaifeng Lyu. Rnns are not transformers (yet): The key bottleneck on in-context retrieval. InInternational Conference on Learning Representations, volume 2025, pages 48813–48856, 2025

2025

[61] [61]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. InInternational Conference on Learning Representations, volume 2025, pages 37228–37253, 2025

2025

[62] [62]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[63] [63]

Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.Advancesin Neural Information Processing Systems, 38:78167–78194, 2026

2026

[64] [64]

Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024

Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URLhttps://github.com/fla-org/flash-linear-attention

2024

[65] [65]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.Advancesin neural information processing systems, 37:115491–115522, 2024

2024

[68] [68]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

2019

[69] [69]

Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. Lolcats: On low-rank linearizing of large language models.arXiv preprint arXiv:2410.10254, 2024

work page arXiv 2024

[70] [70]

The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024

work page arXiv 2024

[71] [71]

Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, et al. Gated slot attention for efficient linear-time sequence modeling.Advancesin Neural Information Processing Systems, 37:116870–116898, 2024

2024

[72] [72]

Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025

Jingwei Zuo, Maksim Velikanov, Ilyas Chahed, Younes Belkada, Dhia Eddine Rhayem, Guillaume Kunsch, Hakim Hacid, Hamza Yous, Brahim Farhat, Ibrahim Khadraoui, et al. Falcon-h1: A family of hybrid-head language models redefining efficiency and performance.arXiv preprint arXiv:2507.22448, 2025. 17 Appendix A Model and Training Configuration We report the com...

work page arXiv 2025