pith. machine review for the scientific record. sign in

arxiv: 2605.08738 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords mixture of expertsstructured pruningknowledge distillationmodel compressionpretraininglarge language modelsMoE efficiencyprogressive pruning
0
0 comments X

The pith

Pruning a pretrained large MoE consistently outperforms training the smaller target architecture from scratch under the same training budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how structured pruning and knowledge distillation should be applied when compressing large mixture-of-experts models at pretraining scale. It shows that pruning an already-trained MoE yields better final results than initializing the smaller model randomly and training it for the same total tokens. Different one-shot ways of reducing the number of experts reach nearly identical performance once enough continued pretraining is done, yet a partial-preservation merging approach improves downstream accuracy. Adding knowledge distillation on top of the usual language-modeling objective, especially with a multi-token prediction variant, strengthens results on knowledge-heavy tasks. Progressive pruning over multiple stages beats abrupt one-shot compression and produces a compressed model that stays competitive.

Core claim

Across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining, and a simple partial-preservation expert merging strategy improves downstream performance across most benchmarks. Combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks, and multi-token prediction distillation yields consistent gains. Progressive pruning schedules outperform one-shot compression, enabling effective reduction of large MoE models.

What carries the argument

Progressive pruning schedules applied to pretrained MoE models during continued pretraining, together with partial-preservation expert merging and multi-token prediction knowledge distillation.

If this is right

  • Pruning a pretrained MoE supplies a stronger initialization than random weights for the target size under equal compute.
  • One-shot expert compression methods become interchangeable after sufficient continued pretraining.
  • Partial-preservation expert merging raises accuracy on downstream tasks relative to standard merging.
  • Pairing knowledge distillation with the language modeling loss, especially via multi-token prediction, improves results over distillation alone.
  • Gradual architecture changes during pruning produce better optimization paths than sudden compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The convergence of different compression methods implies that the continued pretraining phase matters more than the precise initial pruning choice.
  • These schedules could lower the cost of creating families of smaller MoE models tailored to specific domains.
  • Similar progressive transitions might improve compression results for dense transformer models as well.
  • The extra gains on knowledge-intensive tasks suggest the approach could help build more efficient models for reasoning workloads.

Load-bearing premise

The performance advantages seen for pruning over scratch training and for progressive over one-shot schedules will continue to appear in other large MoE architectures and pretraining datasets.

What would settle it

A direct side-by-side comparison, on the same downstream benchmarks, of a smaller MoE trained entirely from scratch for a fixed total token count versus the identical smaller architecture obtained by pruning a larger pretrained MoE and then continuing pretraining for the remaining tokens.

Figures

Figures reproduced from arXiv: 2605.08738 by Bo Zheng, Dayiheng Liu, Liangyu Wang, Rui Men, Shengkun Tang, Siqi Zhang, Xiulong Yuan, Zekun Wang, Zhiqiang Shen, Zihan Qiu.

Figure 1
Figure 1. Figure 1: Overview of the SlimQwen. We first perform structured pruning on a teacher MoE model, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss curves under different initialization and training objectives. Models initialized from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper systematically studies structured pruning and knowledge distillation for compressing large MoE models during pretraining. It claims that pruning a pretrained MoE outperforms training the target architecture from scratch under the same training budget across depth, width, and expert compression; that different one-shot expert compression methods converge to similar performance after continual pretraining (motivating a partial-preservation expert merging strategy); that combining KD with language modeling loss (plus multi-token prediction distillation) improves results; and that progressive pruning outperforms one-shot compression for the same training tokens. The authors compress Qwen3-Next-80A3B to a 23A2B model with competitive downstream performance.

Significance. If the trends hold under rigorous verification, the work supplies practical guidance for efficient large-scale MoE compression, highlighting the value of progressive schedules and initialization from pruning. This could reduce pretraining costs while preserving performance on knowledge-intensive tasks.

major comments (2)
  1. [Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.
  2. [Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.
minor comments (3)
  1. The notation for MoE sizes (80A3B, 23A2B) should be defined explicitly in the introduction or a dedicated notation section.
  2. Figure captions and legends for pruning schedule comparisons should be self-contained, highlighting key metrics without requiring reference to the main text.
  3. [Abstract] The abstract would be strengthened by naming the specific downstream benchmarks on which the 23A2B model retains competitive performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications on the training budget definition and the statistical aspects of our results while committing to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.

    Authors: We appreciate this observation regarding the need for explicit definition. In the manuscript, the training budget is equated by the number of tokens processed during continued pretraining, which is the conventional metric in large-scale pretraining studies. We acknowledge that this results in lower total FLOPs for pruned models owing to fewer active parameters. The performance advantage is therefore achieved despite reduced per-token compute, which we interpret as evidence for the benefit of the pruning-derived initialization. We will revise the abstract and methods to explicitly state the token-based budget and incorporate detailed FLOPs calculations comparing the two settings. revision: yes

  2. Referee: [Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.

    Authors: We recognize that additional statistical controls would increase robustness. However, the computational cost of full-scale pretraining on models of this size precludes multiple independent runs. Results are reported from single runs with fixed seeds and hyperparameters. The observed convergence and progressive pruning advantages appear consistently across depth, width, and expert compression axes as well as across diverse benchmarks, which provides supporting evidence against isolated artifacts. In revision we will add an explicit limitations discussion of the single-run design and include hyperparameter sensitivity results from smaller-scale ablations. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations

full rationale

The paper reports experimental results on pruning, merging, and distillation strategies for MoE models during pretraining. All central claims (pruning outperforms scratch training under matched budget, one-shot methods converge, progressive schedules win, KD+MTP helps) are supported by direct measurements on Qwen3-Next-80A3B and downstream benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked that could reduce to fitted parameters or self-citations by construction. The work contains no load-bearing self-citations for theoretical premises, no ansatz smuggling, and no renaming of known results as novel organization. The budget-metric ambiguity noted by the skeptic is a methodological limitation, not a circularity in any derivation chain. This is the expected outcome for an empirical compression study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical; no mathematical axioms or invented physical entities are introduced. The central claims rest on standard machine-learning assumptions that training dynamics are stable under the chosen optimizers and that downstream benchmarks are representative proxies for capability.

axioms (1)
  • domain assumption Continued pretraining after compression allows different initial compression choices to converge to similar performance
    Invoked to explain why one-shot methods become equivalent after large-scale training

pith-pipeline@v0.9.0 · 5591 in / 1247 out tokens · 45631 ms · 2026-05-12T03:24:19.975833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 13 internal anchors

  1. [1]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  2. [2]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  3. [3]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  4. [4]

    2023 , eprint=

    C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=

  5. [5]

    2024 , eprint=

    CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

  6. [6]

    2023 , eprint=

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=

  7. [7]

    2024 , eprint=

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Are We Done with MMLU? , author=. 2025 , eprint=

  9. [9]

    2022 , eprint=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

  10. [10]

    2025 , eprint=

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=

  12. [12]

    2024 , eprint=

    SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=

  13. [13]

    2024 , eprint=

    ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=

  14. [14]

    2024 , eprint=

    LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=

  15. [15]

    2024 , eprint=

    Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=

  16. [16]

    2024 , eprint=

    Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=

  17. [17]

    2025 , eprint=

    REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=

  18. [18]

    2024 , eprint=

    Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=

  19. [19]

    2024 , eprint=

    Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=

  20. [20]

    2025 , eprint=

    DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=

  23. [23]

    Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =

  24. [24]

    2025 , eprint=

    Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

  25. [25]

    Slicegpt: Compress large language models by deleting rows and columns,

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024

  26. [26]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  27. [27]

    Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

    Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069

  28. [28]

    Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

    Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858

  29. [29]

    Icleval: Evaluating in-context learning ability of large language models, 2024

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955

  30. [30]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  31. [31]

    Are we done with mmlu? CoRR, abs/2406.04127,

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

  32. [32]

    Better & faster large language models via multi-token prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024

  33. [33]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

  34. [34]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322

  35. [35]

    Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025

    Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586

  36. [36]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...

  37. [37]

    Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834

  38. [38]

    Findings of the 2022 conference on machine translation ( WMT 22)

    Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

  39. [40]

    REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

    Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999

  40. [41]

    CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212

  41. [42]

    Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334, 2023

    Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334

  42. [43]

    Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

    Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349

  43. [44]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

  44. [45]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025

  45. [46]

    Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,

    Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800

  46. [47]

    Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

    Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526

  47. [48]

    Llm-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023

  48. [49]

    Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

    Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853

  49. [50]

    Compact language models via pruning and knowledge distillation.arXiv preprint arXiv:2407.14679,

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679

  50. [51]

    Pre-training distillation for large language models: A design space exploration, 2024

    Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215

  51. [52]

    Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al

    Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873

  52. [53]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708

  53. [55]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017

  54. [56]

    Language models are multilingual chain-of-thought reasoners, 2022

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

  55. [57]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795

  56. [58]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

  57. [59]

    Darwinlm: Evolutionary structured pruning of large language models.arXiv preprint arXiv:2502.07780, 2025

    Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780

  58. [60]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261

  59. [61]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

    P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...

  60. [62]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  61. [63]

    Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

    Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

  62. [64]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  63. [65]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574

  64. [66]

    CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information

    Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025

  65. [67]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456

  66. [68]

    arXiv preprint arXiv:2310.06694 , year=

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694

  67. [69]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  68. [70]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464

  69. [71]

    Laco: Large language model pruning via layer collapse, 2024

    Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187

  70. [72]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019

  71. [73]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =

  72. [74]

    Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =

  73. [75]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  74. [76]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  75. [77]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  76. [78]

    CoRR , volume =

    Gemini Team , title =. CoRR , volume =. 2025 , doi =

  77. [79]

    2025 , eprint=

    Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=

  78. [80]

    2025 , eprint=

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

  79. [81]

    2025 , eprint=

    KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=

  80. [82]

    2024 , eprint=

    ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=

Showing first 80 references.