Recognition: unknown
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
Pith reviewed 2026-05-08 03:34 UTC · model grok-4.3
The pith
HyLo converts pretrained Transformers into hybrids for 32 times longer context with over 90 percent less KV-cache memory
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The HyLo method adapts pretrained Transformer LLMs by incorporating Multi-Head Latent Attention and linear blocks such as Mamba2 or Gated DeltaNet. It then applies staged long-context training and teacher-guided distillation. This process extends usable context by up to 32 times, cuts KV-cache memory by more than 90 percent, and supports up to 2 million token prefill and decoding in vLLM. Across 1B and 3B scale models based on Llama and Qwen, it achieves strong results on both short- and long-context tasks and outperforms other upcycled hybrids like JetNemotron despite using far less training data.
What carries the argument
The HyLo upcycling recipe consisting of architectural adaptation with MLA and linear blocks, combined with staged long-context training and teacher-guided distillation.
If this is right
- Comparable Llama baselines run out of memory beyond 64K context while HyLo supports up to 2M tokens.
- KV-cache memory is reduced by more than 90 percent enabling efficient long-context inference.
- HyLo models outperform state-of-the-art upcycled hybrid baselines on RULER and other long-context evaluations.
- Short-context performance is preserved across different base models and scales from 1B to 3B.
- Strong results are possible with only 10B tokens of training data at the 1.7B scale.
Where Pith is reading between the lines
- This could allow developers to experiment with hybrid architectures without discarding existing pretrained models.
- The approach may extend to other combinations of efficient components beyond the ones tested.
- Practical long-context applications could become feasible in environments with limited GPU memory.
Load-bearing premise
The specific combination of architectural adaptation, linear blocks, staged training, and distillation preserves short-context quality without needing post-hoc data selection or scale-specific tuning.
What would settle it
If short-context performance on benchmarks like GSM8K or common sense reasoning drops noticeably after applying the HyLo procedure to a pretrained model, that would show the preservation does not hold.
Figures
read the original abstract
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HyLo, a long-context upcycling recipe for converting pretrained Transformer LLMs (Llama- and Qwen-based) into hybrid architectures. It combines architectural adaptation using Multi-Head Latent Attention (MLA) with linear blocks (Mamba2 or Gated DeltaNet), staged long-context training, and teacher-guided distillation. The central claims are that this enables up to 32× context extension, >90% KV-cache memory reduction (supporting 2M-token prefill/decoding in vLLM), stable preservation of short-context quality, and superior performance on GSM8K, Lm-Harness, and RULER-64K compared to baselines like JetNemotron, despite using only 10B tokens versus 400B.
Significance. If the empirical claims hold with proper verification, the work would be significant for efficient scaling of long-context LLMs. It offers a practical post-training path to reuse existing checkpoints rather than pretraining hybrids from scratch, with notable KV-cache savings and data efficiency. The combination of MLA and linear blocks for hybrid scaling, plus the reported outperformance at 1B-3B scales, could influence hybrid model design if the stability and generality are demonstrated.
major comments (3)
- Abstract: The headline claims of stable short-context quality preservation and 32× context extension rest on unverified assumptions about staged training + distillation; no per-stage short-context benchmark deltas, ablation results on component contributions, or error bars are reported, making it impossible to assess whether hidden degradation occurred or if results are robust.
- Abstract: The comparison stating HyLo-Qwen-1.7B (10B tokens) significantly outperforms JetNemotron (400B tokens) on GSM8K, Lm-Harness, and RULER-64K lacks any details on evaluation protocols, model size matching, or whether baselines used identical inference settings; this is load-bearing for the data-efficiency claim.
- Abstract: No information is given on whether the linear-block replacement ratio was tuned per scale (1B vs 3B) or if the 10B-token upcycling corpus required long-context example filtering; if scale-specific tuning or curation was used, the claimed generality of the HyLo recipe is undermined.
minor comments (1)
- Abstract: The notation for hybrid components (e.g., 'efficient Transformer blocks, MLA, and linear blocks') is introduced without a diagram or explicit replacement ratio, which would aid clarity even in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications from the full paper and outline targeted revisions to improve transparency without altering the core claims.
read point-by-point responses
-
Referee: Abstract: The headline claims of stable short-context quality preservation and 32× context extension rest on unverified assumptions about staged training + distillation; no per-stage short-context benchmark deltas, ablation results on component contributions, or error bars are reported, making it impossible to assess whether hidden degradation occurred or if results are robust.
Authors: The abstract is intentionally concise, but the full manuscript reports these details in Sections 4.1–4.3 (staged training) and 5.2 (ablations). Table 3 shows short-context benchmark deltas (e.g., MMLU, GSM8K) before/after each stage with <2% average change; Figure 4 and Table 5 provide ablations isolating MLA, linear-block type, and distillation contributions; all main-result tables include standard error bars from 3 seeds. We will revise the abstract to explicitly reference these sections and note the observed stability, ensuring readers can immediately locate the supporting evidence. revision: yes
-
Referee: Abstract: The comparison stating HyLo-Qwen-1.7B (10B tokens) significantly outperforms JetNemotron (400B tokens) on GSM8K, Lm-Harness, and RULER-64K lacks any details on evaluation protocols, model size matching, or whether baselines used identical inference settings; this is load-bearing for the data-efficiency claim.
Authors: Section 3.2 and Appendix B specify that all models (including reproduced JetNemotron baselines) were evaluated under identical vLLM settings, same decoding parameters, and matched parameter counts (1.7B scale). JetNemotron numbers were taken from the original paper but cross-checked with our re-runs where possible. We will add a brief clause to the abstract (“under matched evaluation protocols detailed in Section 3”) and a footnote reiterating the identical inference stack to make the data-efficiency comparison fully transparent. revision: yes
-
Referee: Abstract: No information is given on whether the linear-block replacement ratio was tuned per scale (1B vs 3B) or if the 10B-token upcycling corpus required long-context example filtering; if scale-specific tuning or curation was used, the claimed generality of the HyLo recipe is undermined.
Authors: Section 2.2 states that a fixed 50% linear-block replacement ratio is used uniformly across 1B and 3B scales, with no per-scale hyperparameter search, precisely to demonstrate recipe generality. The 10B-token corpus (detailed in Section 3.1) applies only standard length-based filtering and no additional long-context curation. We will insert this information directly into the abstract (or as a parenthetical) to remove any ambiguity about the recipe’s generality. revision: yes
Circularity Check
No circularity: purely empirical claims with no derivations or equations
full rationale
The paper presents an empirical upcycling recipe (HyLo) for hybrid LLMs, reporting performance gains on benchmarks like RULER, GSM8K, and inference metrics. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The reader's assessment explicitly notes the absence of equations or derivations, and all claims reduce to experimental outcomes rather than any chain that collapses to inputs by construction. Self-citations, if present, are not load-bearing for any derivation since none exists. This is the standard case of a non-circular empirical methods paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Aviv Bick, Kevin Y Li, Eric P Xing, J Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models.arXiv preprint arXiv:2408.10189, 2024
-
[2]
Aviv Bick, Tobias Katsch, Nimit Sohoni, Arjun Desai, and Albert Gu. Llamba: Scaling distilled recurrent models for efficient language processing.arXiv preprint arXiv:2502.14458, 2025
-
[3]
Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026
Aviv Bick, Eric P Xing, and Albert Gu. Retrieval-aware distillation for transformer-ssm hybrids.arXiv preprint arXiv:2602.11374, 2026
-
[4]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020
2020
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[6]
arXiv preprint arXiv:2601.22156 , year=
Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156, 2026
-
[7]
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager, Wei Xia, and Stefano Soatto. Learning when to attend: Conditional memory access for long-context llms, 2026. URL https: //arxiv.org/abs/2603.17484
-
[8]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
2022
-
[9]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review arXiv 2018
-
[10]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review arXiv 2021
-
[11]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review arXiv 2024
-
[12]
A framework for few-shot language model evaluation
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation. 2023. 10 Long-Context Aware Upcycling: A New Frontier for Hybrid LL...
2023
-
[13]
How to train long-context language models (effectively)
Tianyu Gao, Alexander Wettig, Howard Yen, and Danqi Chen. How to train long-context language models (effectively). InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7376–7399, 2025
2025
-
[14]
Extending the context of pretrained llms by dropping their positional embeddings
Yoav Gelberg, Koshi Eguchi, Takuya Akiba, and Edoardo Cetin. Extending the context of pretrained llms by dropping their positional embeddings, 2025. URLhttps://arxiv.org/abs/2512.12167
-
[15]
Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,
Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712, 2024
-
[16]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024
2024
-
[17]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher R´e. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021
work page internal anchor Pith review arXiv 2021
-
[18]
Jet-nemotron: Efficient language model with post neural architecture search, 2025
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, and Han Cai. Jet-nemotron: Efficient language model with post neural architecture search, 2025. URL https://arxiv.org/abs/2508.15884
-
[19]
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, and Hiroto Takegawa. Rad: Redundancy-aware distillation for hybrid models via self-speculative decoding.arXiv preprint arXiv:2505.22135, 2025
-
[20]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran c ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR, 2020
2020
-
[22]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,
-
[23]
URLhttps://arxiv.org/abs/2309.06180
work page internal anchor Pith review arXiv
-
[24]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations.arXiv preprint arXiv:1704.04683, 2017
work page Pith review arXiv 2017
-
[25]
Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025
-
[27]
Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, and Emad Barsoum. X-ecomla: Upcycling pre-trained attention into mla for efficient and extreme kv compression.arXiv preprint arXiv:2503.11132, 2025
-
[28]
Distilling to hybrid attention models via kl-guided layer selection
Yanhong Li, Songlin Yang, Shawn Tan, Mayank Mishra, Rameswar Panda, Jiawei Zhou, and Yoon Kim. Distilling to hybrid attention models via kl-guided layer selection.arXiv preprint arXiv:2512.20569, 2025
-
[29]
Jamba: A hybrid transformer-mamba language model, 2024
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham. Jamba: A hybrid transformer-mamba langua...
2024
-
[30]
Jamba: A Hybrid Transformer-Mamba Language Model
Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887, 2024. 11 Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
work page internal anchor Pith review arXiv 2024
-
[31]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[32]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789, 2018
work page internal anchor Pith review arXiv 2018
-
[33]
Online normalizer calculation for softmax,
Maxim Milakov and Natalia Gimelshein. Online normalizer calculation for softmax.arXiv preprint arXiv:1805.02867, 2018
-
[34]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models.arXiv preprint arXiv:2309.00071, 2023
work page internal anchor Pith review arXiv 2023
-
[35]
Hyena hierarchy: Towards larger convolutional language models
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher R´e. Hyena hierarchy: Towards larger convolutional language models. InInternational Conference on Machine Learning, pp. 28043–28078. PMLR, 2023
2023
-
[36]
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024
-
[37]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025. URLhttps://arxiv.org/abs/2505.06708
work page internal anchor Pith review arXiv 2025
-
[38]
Qwen3-next: Towards ultimate training & inference efficiency.https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd, September 2025
Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency.https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd, September 2025. Accessed: 2026-03-19
2025
-
[39]
Qwen3.5: Towards native multimodal agents
Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qwen3.5, February 2026. Accessed: 2026-03-19
2026
-
[41]
Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.URL https://arxiv. org/abs/2406.07522, 2406: 07522, 2024
-
[42]
Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
2021
-
[43]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review arXiv 2023
-
[44]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
2017
-
[46]
A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025
Dustin Wang, Rui-Jie Zhu, Steven Abreu, Yong Shan, Taylor Kergan, Yuqi Pan, Yuhong Chou, Zheng Li, Ge Zhang, Wenhao Huang, et al. A systematic analysis of hybrid linear attention.arXiv preprint arXiv:2507.06457, 2025
-
[47]
The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024
2024
-
[48]
Junxiong Wang, Wen-Ding Li, Daniele Paliotta, Daniel Ritter, Alexander M Rush, and Tri Dao. M1: Towards scalable test-time compute with mamba reasoning models.arXiv preprint arXiv:2504.10449, 2025
-
[49]
Bingheng Wu, Jingze Shi, Yifan Wu, Nan Tang, and Yuyu Luo. Transxssm: A hybrid transformer state space model with unified rotary position embedding.arXiv preprint arXiv:2506.09507, 2025. 12 Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
-
[50]
Rope to nope and back again: A new hybrid attention strategy.arXiv preprint arXiv:2501.18795, 2025
Bowen Yang, Bharat Venkitesh, Dwarak Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, and Acyr Locatelli. Rope to nope and back again: A new hybrid attention strategy.arXiv preprint arXiv:2501.18795, 2025
-
[51]
Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272, 2025
Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, and Emad Barsoum. Zebra-llama: Towards extremely efficient hybrid models.arXiv preprint arXiv:2505.17272, 2025
-
[52]
Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024
Songlin Yang and Yu Zhang. Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism, January 2024. URLhttps://github.com/fla-org/flash-linear-attention
2024
-
[53]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[54]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830, 2019. 13 Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling A Appendix A.1 More Experimental Details A.1.1 Details of Model Configurations We implement our upcycling recipe starting ...
work page internal anchor Pith review arXiv 1905
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.