arxiv: 2605.08738 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Shengkun Tang , Zekun Wang , Bo Zheng , Liangyu Wang , Rui Men , Siqi Zhang , Xiulong Yuan , Zihan Qiu

show 2 more authors

Zhiqiang Shen Dayiheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mixture of expertsstructured pruningknowledge distillationmodel compressionpretraininglarge language modelsMoE efficiencyprogressive pruning

0 comments

The pith

Pruning a pretrained large MoE consistently outperforms training the smaller target architecture from scratch under the same training budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how structured pruning and knowledge distillation should be applied when compressing large mixture-of-experts models at pretraining scale. It shows that pruning an already-trained MoE yields better final results than initializing the smaller model randomly and training it for the same total tokens. Different one-shot ways of reducing the number of experts reach nearly identical performance once enough continued pretraining is done, yet a partial-preservation merging approach improves downstream accuracy. Adding knowledge distillation on top of the usual language-modeling objective, especially with a multi-token prediction variant, strengthens results on knowledge-heavy tasks. Progressive pruning over multiple stages beats abrupt one-shot compression and produces a compressed model that stays competitive.

Core claim

Across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining, and a simple partial-preservation expert merging strategy improves downstream performance across most benchmarks. Combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks, and multi-token prediction distillation yields consistent gains. Progressive pruning schedules outperform one-shot compression, enabling effective reduction of large MoE models.

What carries the argument

Progressive pruning schedules applied to pretrained MoE models during continued pretraining, together with partial-preservation expert merging and multi-token prediction knowledge distillation.

If this is right

Pruning a pretrained MoE supplies a stronger initialization than random weights for the target size under equal compute.
One-shot expert compression methods become interchangeable after sufficient continued pretraining.
Partial-preservation expert merging raises accuracy on downstream tasks relative to standard merging.
Pairing knowledge distillation with the language modeling loss, especially via multi-token prediction, improves results over distillation alone.
Gradual architecture changes during pruning produce better optimization paths than sudden compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The convergence of different compression methods implies that the continued pretraining phase matters more than the precise initial pruning choice.
These schedules could lower the cost of creating families of smaller MoE models tailored to specific domains.
Similar progressive transitions might improve compression results for dense transformer models as well.
The extra gains on knowledge-intensive tasks suggest the approach could help build more efficient models for reasoning workloads.

Load-bearing premise

The performance advantages seen for pruning over scratch training and for progressive over one-shot schedules will continue to appear in other large MoE architectures and pretraining datasets.

What would settle it

A direct side-by-side comparison, on the same downstream benchmarks, of a smaller MoE trained entirely from scratch for a fixed total token count versus the identical smaller architecture obtained by pruning a larger pretrained MoE and then continuing pretraining for the remaining tokens.

Figures

Figures reproduced from arXiv: 2605.08738 by Bo Zheng, Dayiheng Liu, Liangyu Wang, Rui Men, Shengkun Tang, Siqi Zhang, Xiulong Yuan, Zekun Wang, Zhiqiang Shen, Zihan Qiu.

**Figure 2.** Figure 2: Training loss curves under different initialization and training objectives. Models initialized from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pruning a pretrained MoE beats scratch training under the same token count, but the FLOPs difference makes the comparison hard to interpret cleanly.

read the letter

The main thing to know is that this paper finds pruning a large pretrained MoE model gives better final performance than training the smaller target architecture from scratch when both see the same number of tokens. Progressive pruning also beats one-shot compression, and they add a partial-preservation expert merge plus multi-token prediction distillation that each help a little on downstream tasks. They end up with a 23A2B model from the 80A3B Qwen3-Next base that stays competitive.

Referee Report

2 major / 3 minor

Summary. The paper systematically studies structured pruning and knowledge distillation for compressing large MoE models during pretraining. It claims that pruning a pretrained MoE outperforms training the target architecture from scratch under the same training budget across depth, width, and expert compression; that different one-shot expert compression methods converge to similar performance after continual pretraining (motivating a partial-preservation expert merging strategy); that combining KD with language modeling loss (plus multi-token prediction distillation) improves results; and that progressive pruning outperforms one-shot compression for the same training tokens. The authors compress Qwen3-Next-80A3B to a 23A2B model with competitive downstream performance.

Significance. If the trends hold under rigorous verification, the work supplies practical guidance for efficient large-scale MoE compression, highlighting the value of progressive schedules and initialization from pruning. This could reduce pretraining costs while preserving performance on knowledge-intensive tasks.

major comments (2)

[Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.
[Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.

minor comments (3)

The notation for MoE sizes (80A3B, 23A2B) should be defined explicitly in the introduction or a dedicated notation section.
Figure captions and legends for pruning schedule comparisons should be self-contained, highlighting key metrics without requiring reference to the main text.
[Abstract] The abstract would be strengthened by naming the specific downstream benchmarks on which the 23A2B model retains competitive performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications on the training budget definition and the statistical aspects of our results while committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.

Authors: We appreciate this observation regarding the need for explicit definition. In the manuscript, the training budget is equated by the number of tokens processed during continued pretraining, which is the conventional metric in large-scale pretraining studies. We acknowledge that this results in lower total FLOPs for pruned models owing to fewer active parameters. The performance advantage is therefore achieved despite reduced per-token compute, which we interpret as evidence for the benefit of the pruning-derived initialization. We will revise the abstract and methods to explicitly state the token-based budget and incorporate detailed FLOPs calculations comparing the two settings. revision: yes
Referee: [Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.

Authors: We recognize that additional statistical controls would increase robustness. However, the computational cost of full-scale pretraining on models of this size precludes multiple independent runs. Results are reported from single runs with fixed seeds and hyperparameters. The observed convergence and progressive pruning advantages appear consistently across depth, width, and expert compression axes as well as across diverse benchmarks, which provides supporting evidence against isolated artifacts. In revision we will add an explicit limitations discussion of the single-run design and include hyperparameter sensitivity results from smaller-scale ablations. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no derivations

full rationale

The paper reports experimental results on pruning, merging, and distillation strategies for MoE models during pretraining. All central claims (pruning outperforms scratch training under matched budget, one-shot methods converge, progressive schedules win, KD+MTP helps) are supported by direct measurements on Qwen3-Next-80A3B and downstream benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked that could reduce to fitted parameters or self-citations by construction. The work contains no load-bearing self-citations for theoretical premises, no ansatz smuggling, and no renaming of known results as novel organization. The budget-metric ambiguity noted by the skeptic is a methodological limitation, not a circularity in any derivation chain. This is the expected outcome for an empirical compression study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical; no mathematical axioms or invented physical entities are introduced. The central claims rest on standard machine-learning assumptions that training dynamics are stable under the chosen optimizers and that downstream benchmarks are representative proxies for capability.

axioms (1)

domain assumption Continued pretraining after compression allows different initial compression choices to converge to similar performance
Invoked to explain why one-shot methods become equivalent after large-scale training

pith-pipeline@v0.9.0 · 5591 in / 1247 out tokens · 45631 ms · 2026-05-12T03:24:19.975833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 13 internal anchors

[1]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[2]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[3]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[4]

2023 , eprint=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=

work page 2023
[5]

2024 , eprint=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

work page 2024
[6]

2023 , eprint=

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=

work page 2023
[7]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

work page 2025
[9]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022
[10]

2025 , eprint=

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=

work page 2024
[12]

2024 , eprint=

SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=

work page 2024
[13]

2024 , eprint=

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=

work page 2024
[14]

2024 , eprint=

LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=

work page 2024
[15]

2024 , eprint=

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=

work page 2024
[16]

2024 , eprint=

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=

work page 2024
[17]

2025 , eprint=

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=

work page 2025
[18]

2024 , eprint=

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=

work page 2024
[19]

2024 , eprint=

Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=

work page 2024
[20]

2025 , eprint=

DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=

work page 2025
[23]

Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =

work page
[24]

2025 , eprint=

Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

work page 2025
[25]

Slicegpt: Compress large language models by deleting rows and columns,

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024

work page arXiv 2024
[26]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858

work page arXiv 2026
[29]

Icleval: Evaluating in-context learning ability of large language models, 2024

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955

work page arXiv 2024
[30]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

work page arXiv 2025
[32]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322

work page arXiv 2023
[35]

Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025

Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586

work page arXiv 2025
[36]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834

work page arXiv 2024
[38]

Findings of the 2022 conference on machine translation ( WMT 22)

Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

work page 2022
[40]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999

work page internal anchor Pith review arXiv 2025
[41]

CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212

work page arXiv 2024
[42]

Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334, 2023

Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334

work page arXiv 2024
[43]

Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349

work page arXiv 2025
[44]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

work page internal anchor Pith review arXiv 2023
[45]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025

work page arXiv 2024
[46]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800

work page arXiv 2024
[47]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526

work page arXiv 2025
[48]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023

work page 2023
[49]

Shortgpt: Layers in large language models are more redundant than you expect.arXiv preprint arXiv:2403.03853,

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853

work page arXiv 2024
[50]

Compact language models via pruning and knowledge distillation.arXiv preprint arXiv:2407.14679,

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679

work page arXiv 2024
[51]

Pre-training distillation for large language models: A design space exploration, 2024

Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215

work page arXiv 2024
[52]

Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, et al

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873

work page arXiv 2025
[53]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708

work page internal anchor Pith review arXiv 2025
[55]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017

work page 2017
[56]

Language models are multilingual chain-of-thought reasoners, 2022

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

work page 2022
[57]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795

work page arXiv 2026
[58]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Darwinlm: Evolutionary structured pruning of large language models.arXiv preprint arXiv:2502.07780, 2025

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780

work page arXiv 2025
[60]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[61]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739, 2025

P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...

work page arXiv 2025
[62]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[63]

Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

work page 2025
[64]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[65]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information

Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025

work page 2025
[67]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456

work page 2024
[68]

arXiv preprint arXiv:2310.06694 , year=

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694

work page arXiv 2024
[69]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464

work page internal anchor Pith review arXiv 2025
[71]

Laco: Large language model pruning via layer collapse, 2024

Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187

work page arXiv 2024
[72]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019

work page 2019
[73]

Le and Geoffrey E

Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =

work page
[74]

Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =

work page
[75]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[76]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[77]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

work page
[78]

CoRR , volume =

Gemini Team , title =. CoRR , volume =. 2025 , doi =

work page 2025
[79]

2025 , eprint=

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=

work page 2025
[80]

2025 , eprint=

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

work page 2025
[81]

2025 , eprint=

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=

work page 2025
[82]

2024 , eprint=

ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=

work page 2024

Showing first 80 references.