pith. sign in

arxiv: 2605.08738 · v2 · pith:B7XDWRI4new · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Pith reviewed 2026-05-20 23:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords mixture of expertspruningknowledge distillationcontinual pretrainingmodel compressionMoElarge language models
0
0 comments X

The pith

Pruning a pretrained MoE and continuing training outperforms building the smaller model from scratch with the same budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to apply pruning and distillation to compress large mixture-of-experts models during pretraining. It establishes that pruning a pretrained model and then continuing to train it outperforms training the compressed architecture from scratch when given the same training budget. Different expert compression techniques end up performing similarly after sufficient continued training, but a partial preservation merging method boosts results on downstream tasks. Combining distillation with standard language modeling loss works better than distillation alone, and multi-token prediction adds further benefits. Progressive pruning over multiple stages also improves the final model compared to pruning everything at once.

Core claim

Pruning a pretrained MoE across depth, width, and expert compression consistently outperforms training the target architecture from scratch under the same training budget. Different one-shot expert compression methods converge to similar performance after continued pretraining. A partial-preservation expert merging strategy improves downstream performance across most benchmarks. Combining knowledge distillation with language modeling loss outperforms distillation alone, and multi-token prediction distillation yields gains. Progressive pruning schedules outperform one-shot compression.

What carries the argument

Structured pruning of pretrained MoE models combined with continual pretraining and knowledge distillation using multi-token prediction.

If this is right

  • Progressive pruning schedules outperform one-shot compression.
  • Different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining.
  • A partial-preservation expert merging strategy improves downstream performance.
  • Combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks.
  • Multi-token prediction distillation yields consistent gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that leveraging pretrained large MoE checkpoints can accelerate the development of efficient smaller variants.
  • The convergence of different compression methods after continued training may indicate that the optimization landscape allows multiple paths to similar solutions.
  • These techniques could be tested on other model families to see if the advantage of pruning over from-scratch training holds more generally.

Load-bearing premise

The assumption that the chosen continual pretraining budget and data mixture are sufficient for the pruned models to recover performance, and that the specific Qwen3-Next-80A3B architecture and downstream benchmarks are representative of general MoE behavior.

What would settle it

Observing whether training the target architecture from scratch with the same training budget achieves performance equal to or better than the pruned model on the evaluation tasks.

Figures

Figures reproduced from arXiv: 2605.08738 by Bo Zheng, Dayiheng Liu, Liangyu Wang, Rui Men, Shengkun Tang, Siqi Zhang, Xiulong Yuan, Zekun Wang, Zhiqiang Shen, Zihan Qiu.

Figure 1
Figure 1. Figure 1: Overview of the SlimQwen. We first perform structured pruning on a teacher MoE model, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss curves under different initialization and training objectives. Models initialized from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript empirically investigates structured pruning and knowledge distillation for compressing large Mixture-of-Experts (MoE) models at pretraining scale. Using the Qwen3-Next-80A3B model as base, it examines depth, width, and expert compression. Main findings are that pruning a pretrained MoE outperforms training the target architecture from scratch under identical training budgets; different one-shot expert compression methods reach similar final performance after continual pretraining; a partial-preservation expert merging strategy is introduced and improves downstream results; combining KD with language modeling loss (plus multi-token prediction distillation) is effective; and progressive pruning schedules outperform one-shot compression. The work ends by producing a 23A2B compressed model with competitive performance.

Significance. If the empirical patterns hold under broader conditions, the results supply actionable guidance for cost-effective MoE compression during pretraining, potentially reducing the resources needed to develop smaller yet capable models. The systematic head-to-head comparison of pruning versus from-scratch training and the progressive-schedule finding are the most transferable contributions. The large-scale experiments constitute a strength, though absence of error bars and limited ablation on training budgets reduce immediate robustness.

major comments (2)
  1. [§4.2] §4.2 (Continual Pretraining Setup): the central claim that pruning outperforms from-scratch training under the same budget assumes the fixed token count and data mixture suffice for the smaller target architecture to recover. No ablation is reported that extends the from-scratch baseline or adapts the mixture specifically for the 23A2B model; if the baseline remains under-optimized, the observed gap may reflect optimization trajectory rather than a general pruning advantage.
  2. [Tables 2–4] Tables 2–4 (main results): performance deltas between pruned and from-scratch models are presented without error bars, standard deviations across seeds, or statistical significance tests. This makes it impossible to judge whether the reported outperformance is reliable or sensitive to random variation, directly affecting confidence in the first key finding.
minor comments (3)
  1. [§3.1] §3.1: the partial-preservation merging procedure is described at a high level; a short pseudocode or explicit formula for how expert weights are combined would improve reproducibility.
  2. [Figure 3] Figure 3: axis labels and legend entries for the progressive versus one-shot curves are difficult to distinguish at the printed size; increasing font size or adding a clearer caption would aid readability.
  3. [Related Work] Related Work: several recent papers on MoE pruning (post-2023) are not cited; adding them would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work exploring pruning and distillation for large MoE models. We address each major comment in detail below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [§4.2] the central claim that pruning outperforms from-scratch training under the same budget assumes the fixed token count and data mixture suffice for the smaller target architecture to recover. No ablation is reported that extends the from-scratch baseline or adapts the mixture specifically for the 23A2B model; if the baseline remains under-optimized, the observed gap may reflect optimization trajectory rather than a general pruning advantage.

    Authors: We appreciate this point. Our experimental design focuses on comparing pruning and from-scratch training under a fixed and identical training budget, as this reflects realistic constraints in model development. The consistent superiority of pruned models across multiple compression dimensions (depth, width, and experts) supports that the advantage stems from better initialization rather than solely optimization differences. Nevertheless, we acknowledge the value of further ablations. In the revised version, we will include additional discussion on this limitation and suggest adapting data mixtures as future work. revision: partial

  2. Referee: [Tables 2–4] performance deltas between pruned and from-scratch models are presented without error bars, standard deviations across seeds, or statistical significance tests. This makes it impossible to judge whether the reported outperformance is reliable or sensitive to random variation, directly affecting confidence in the first key finding.

    Authors: We agree that the absence of error bars and statistical tests limits the ability to assess variability. Due to the high computational cost of large-scale pretraining experiments involving hundreds of billions of tokens, performing multiple runs with different seeds is impractical in our setting. We have strived for consistency by evaluating across different model scales and compression methods. In the revision, we will add a dedicated limitations section discussing this aspect and the implications for result reliability. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no self-referential derivations

full rationale

The paper reports results from controlled pretraining experiments comparing pruned MoE models against from-scratch baselines under fixed token budgets, along with ablation studies on expert merging and distillation strategies. No equations, closed-form derivations, or parameter-fitting steps are presented that could reduce claimed performance gains to quantities defined on the same evaluation data. All findings rest on observable training outcomes and downstream benchmarks rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. The work is therefore self-contained against external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning assumptions about optimization trajectories and benchmark validity rather than new axioms or invented entities. No free parameters are explicitly fitted to produce the headline result; hyperparameters such as pruning ratios and learning rates are chosen but not presented as load-bearing fitted constants.

axioms (1)
  • domain assumption Continued pretraining on the same data distribution allows pruned models to recover most capability.
    Invoked when claiming that pruning plus continued training beats training from scratch under equal token budget.

pith-pipeline@v0.9.0 · 5822 in / 1290 out tokens · 25978 ms · 2026-05-20T23:18:25.084291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 16 internal anchors

  1. [1]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  2. [2]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  3. [3]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  4. [4]

    2023 , eprint=

    C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=

  5. [5]

    2024 , eprint=

    CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

  6. [6]

    2023 , eprint=

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=

  7. [7]

    2024 , eprint=

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    Are We Done with MMLU? , author=. 2025 , eprint=

  9. [9]

    2022 , eprint=

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

  10. [10]

    2025 , eprint=

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=

  11. [11]

    2024 , eprint=

    Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=

  12. [12]

    2024 , eprint=

    SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=

  13. [13]

    2024 , eprint=

    ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=

  14. [14]

    2024 , eprint=

    LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=

  15. [15]

    2024 , eprint=

    Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=

  16. [16]

    2024 , eprint=

    Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=

  17. [17]

    2025 , eprint=

    REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=

  18. [18]

    2024 , eprint=

    Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=

  19. [19]

    2024 , eprint=

    Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=

  20. [20]

    2025 , eprint=

    DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=

  23. [23]

    Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =

  24. [24]

    2025 , eprint=

    Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

  25. [25]

    Slicegpt: Compress large language models by deleting rows and columns

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024

  26. [26]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

  27. [27]

    Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

    Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069

  28. [28]

    Mtbench: A multimodal time series benchmark for temporal reasoning and question answering.arXiv preprint arXiv:2503.16858, 2025

    Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858

  29. [29]

    Icleval: Evaluating in-context learning ability of large language models, 2024

    Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955

  30. [30]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  31. [31]

    Are we done with mmlu? CoRR, abs/2406.04127,

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

  32. [32]

    Better & faster large language models via multi-token prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024

  33. [33]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

  34. [34]

    C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

    Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322

  35. [35]

    Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025

    Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586

  36. [36]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...

  37. [37]

    arXiv preprint arXiv:2402.02834 , volume=

    Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834

  38. [38]

    Findings of the 2022 conference on machine translation ( WMT 22)

    Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

  39. [40]

    REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

    Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999

  40. [41]

    CMMLU: Measuring massive multitask language understanding in Chinese

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212

  41. [42]

    org/CorpusID:220265858

    Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334

  42. [43]

    Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

    Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349

  43. [44]

    Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

  44. [45]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025

  45. [46]

    Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

    Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800

  46. [47]

    Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

    Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526

  47. [48]

    Llm-pruner: On the structural pruning of large language models

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023

  48. [49]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853

  49. [50]

    Compact language models via pruning and knowledge distillation, 2024

    Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679

  50. [51]

    Pre-training distillation for large language models: A design space exploration, 2024

    Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215

  51. [52]

    Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

    Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873

  52. [53]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708

  53. [55]

    Le, Geoffrey E

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017

  54. [56]

    Language models are multilingual chain-of-thought reasoners, 2022

    Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

  55. [57]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

    Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795

  56. [58]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

  57. [59]

    arXiv preprint arXiv:2502.07780 , year=

    Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780

  58. [60]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261

  59. [61]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...

  60. [62]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  61. [63]

    Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

    Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

  62. [64]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

  63. [65]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574

  64. [66]

    CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information

    Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025

  65. [67]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456

  66. [68]

    arXiv preprint arXiv:2310.06694 (2023)

    Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694

  67. [69]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  68. [70]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464

  69. [71]

    Laco: Large language model pruning via layer collapse, 2024

    Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187

  70. [73]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019

  71. [74]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =

  72. [75]

    Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =

  73. [76]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  74. [77]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  75. [78]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  76. [79]

    CoRR , volume =

    Gemini Team , title =. CoRR , volume =. 2025 , doi =

  77. [80]

    2025 , eprint=

    Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=

  78. [81]

    2025 , eprint=

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

  79. [82]

    2025 , eprint=

    KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=

  80. [83]

    2024 , eprint=

    ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=

Showing first 80 references.