SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Bo Zheng; Dayiheng Liu; Liangyu Wang; Rui Men; Shengkun Tang; Siqi Zhang; Xiulong Yuan; Zekun Wang; Zhiqiang Shen; Zihan Qiu

arxiv: 2605.08738 · v2 · pith:B7XDWRI4new · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Shengkun Tang , Zekun Wang , Bo Zheng , Liangyu Wang , Rui Men , Siqi Zhang , Xiulong Yuan , Zihan Qiu

show 2 more authors

Zhiqiang Shen Dayiheng Liu

This is my paper

Pith reviewed 2026-05-20 23:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords mixture of expertspruningknowledge distillationcontinual pretrainingmodel compressionMoElarge language models

0 comments

The pith

Pruning a pretrained MoE and continuing training outperforms building the smaller model from scratch with the same budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how to apply pruning and distillation to compress large mixture-of-experts models during pretraining. It establishes that pruning a pretrained model and then continuing to train it outperforms training the compressed architecture from scratch when given the same training budget. Different expert compression techniques end up performing similarly after sufficient continued training, but a partial preservation merging method boosts results on downstream tasks. Combining distillation with standard language modeling loss works better than distillation alone, and multi-token prediction adds further benefits. Progressive pruning over multiple stages also improves the final model compared to pruning everything at once.

Core claim

Pruning a pretrained MoE across depth, width, and expert compression consistently outperforms training the target architecture from scratch under the same training budget. Different one-shot expert compression methods converge to similar performance after continued pretraining. A partial-preservation expert merging strategy improves downstream performance across most benchmarks. Combining knowledge distillation with language modeling loss outperforms distillation alone, and multi-token prediction distillation yields gains. Progressive pruning schedules outperform one-shot compression.

What carries the argument

Structured pruning of pretrained MoE models combined with continual pretraining and knowledge distillation using multi-token prediction.

If this is right

Progressive pruning schedules outperform one-shot compression.
Different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining.
A partial-preservation expert merging strategy improves downstream performance.
Combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks.
Multi-token prediction distillation yields consistent gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that leveraging pretrained large MoE checkpoints can accelerate the development of efficient smaller variants.
The convergence of different compression methods after continued training may indicate that the optimization landscape allows multiple paths to similar solutions.
These techniques could be tested on other model families to see if the advantage of pruning over from-scratch training holds more generally.

Load-bearing premise

The assumption that the chosen continual pretraining budget and data mixture are sufficient for the pruned models to recover performance, and that the specific Qwen3-Next-80A3B architecture and downstream benchmarks are representative of general MoE behavior.

What would settle it

Observing whether training the target architecture from scratch with the same training budget achieves performance equal to or better than the pruned model on the evaluation tasks.

Figures

Figures reproduced from arXiv: 2605.08738 by Bo Zheng, Dayiheng Liu, Liangyu Wang, Rui Men, Shengkun Tang, Siqi Zhang, Xiulong Yuan, Zekun Wang, Zhiqiang Shen, Zihan Qiu.

**Figure 2.** Figure 2: Training loss curves under different initialization and training objectives. Models initialized from [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript empirically investigates structured pruning and knowledge distillation for compressing large Mixture-of-Experts (MoE) models at pretraining scale. Using the Qwen3-Next-80A3B model as base, it examines depth, width, and expert compression. Main findings are that pruning a pretrained MoE outperforms training the target architecture from scratch under identical training budgets; different one-shot expert compression methods reach similar final performance after continual pretraining; a partial-preservation expert merging strategy is introduced and improves downstream results; combining KD with language modeling loss (plus multi-token prediction distillation) is effective; and progressive pruning schedules outperform one-shot compression. The work ends by producing a 23A2B compressed model with competitive performance.

Significance. If the empirical patterns hold under broader conditions, the results supply actionable guidance for cost-effective MoE compression during pretraining, potentially reducing the resources needed to develop smaller yet capable models. The systematic head-to-head comparison of pruning versus from-scratch training and the progressive-schedule finding are the most transferable contributions. The large-scale experiments constitute a strength, though absence of error bars and limited ablation on training budgets reduce immediate robustness.

major comments (2)

[§4.2] §4.2 (Continual Pretraining Setup): the central claim that pruning outperforms from-scratch training under the same budget assumes the fixed token count and data mixture suffice for the smaller target architecture to recover. No ablation is reported that extends the from-scratch baseline or adapts the mixture specifically for the 23A2B model; if the baseline remains under-optimized, the observed gap may reflect optimization trajectory rather than a general pruning advantage.
[Tables 2–4] Tables 2–4 (main results): performance deltas between pruned and from-scratch models are presented without error bars, standard deviations across seeds, or statistical significance tests. This makes it impossible to judge whether the reported outperformance is reliable or sensitive to random variation, directly affecting confidence in the first key finding.

minor comments (3)

[§3.1] §3.1: the partial-preservation merging procedure is described at a high level; a short pseudocode or explicit formula for how expert weights are combined would improve reproducibility.
[Figure 3] Figure 3: axis labels and legend entries for the progressive versus one-shot curves are difficult to distinguish at the printed size; increasing font size or adding a clearer caption would aid readability.
[Related Work] Related Work: several recent papers on MoE pruning (post-2023) are not cited; adding them would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work exploring pruning and distillation for large MoE models. We address each major comment in detail below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [§4.2] the central claim that pruning outperforms from-scratch training under the same budget assumes the fixed token count and data mixture suffice for the smaller target architecture to recover. No ablation is reported that extends the from-scratch baseline or adapts the mixture specifically for the 23A2B model; if the baseline remains under-optimized, the observed gap may reflect optimization trajectory rather than a general pruning advantage.

Authors: We appreciate this point. Our experimental design focuses on comparing pruning and from-scratch training under a fixed and identical training budget, as this reflects realistic constraints in model development. The consistent superiority of pruned models across multiple compression dimensions (depth, width, and experts) supports that the advantage stems from better initialization rather than solely optimization differences. Nevertheless, we acknowledge the value of further ablations. In the revised version, we will include additional discussion on this limitation and suggest adapting data mixtures as future work. revision: partial
Referee: [Tables 2–4] performance deltas between pruned and from-scratch models are presented without error bars, standard deviations across seeds, or statistical significance tests. This makes it impossible to judge whether the reported outperformance is reliable or sensitive to random variation, directly affecting confidence in the first key finding.

Authors: We agree that the absence of error bars and statistical tests limits the ability to assess variability. Due to the high computational cost of large-scale pretraining experiments involving hundreds of billions of tokens, performing multiple runs with different seeds is impractical in our setting. We have strived for consistency by evaluating across different model scales and compression methods. In the revision, we will add a dedicated limitations section discussing this aspect and the implications for result reliability. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with no self-referential derivations

full rationale

The paper reports results from controlled pretraining experiments comparing pruned MoE models against from-scratch baselines under fixed token budgets, along with ablation studies on expert merging and distillation strategies. No equations, closed-form derivations, or parameter-fitting steps are presented that could reduce claimed performance gains to quantities defined on the same evaluation data. All findings rest on observable training outcomes and downstream benchmarks rather than any self-definition, fitted-input prediction, or load-bearing self-citation chain. The work is therefore self-contained against external replication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard machine-learning assumptions about optimization trajectories and benchmark validity rather than new axioms or invented entities. No free parameters are explicitly fitted to produce the headline result; hyperparameters such as pruning ratios and learning rates are chosen but not presented as load-bearing fitted constants.

axioms (1)

domain assumption Continued pretraining on the same data distribution allows pruned models to recover most capability.
Invoked when claiming that pruning plus continued training beats training from scratch under equal token budget.

pith-pipeline@v0.9.0 · 5822 in / 1290 out tokens · 25978 ms · 2026-05-20T23:18:25.084291+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a simple partial-preservation expert merging strategy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · 16 internal anchors

[1]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019
[2]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[3]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[4]

2023 , eprint=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=

work page 2023
[5]

2024 , eprint=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

work page 2024
[6]

2023 , eprint=

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=

work page 2023
[7]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

work page 2025
[9]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022
[10]

2025 , eprint=

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=

work page 2024
[12]

2024 , eprint=

SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=

work page 2024
[13]

2024 , eprint=

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=

work page 2024
[14]

2024 , eprint=

LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=

work page 2024
[15]

2024 , eprint=

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=

work page 2024
[16]

2024 , eprint=

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=

work page 2024
[17]

2025 , eprint=

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=

work page 2025
[18]

2024 , eprint=

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=

work page 2024
[19]

2024 , eprint=

Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=

work page 2024
[20]

2025 , eprint=

DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=

work page 2025
[21]

2025 , eprint=

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=

work page 2025
[22]

2025 , eprint=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=

work page 2025
[23]

Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =

work page
[24]

2025 , eprint=

Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

work page 2025
[25]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024

work page arXiv 2024
[26]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Mtbench: A multimodal time series benchmark for temporal reasoning and question answering.arXiv preprint arXiv:2503.16858, 2025

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858

work page arXiv 2026
[29]

Icleval: Evaluating in-context learning ability of large language models, 2024

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955

work page arXiv 2024
[30]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

work page arXiv 2025
[32]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322

work page arXiv 2023
[35]

Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025

Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586

work page arXiv 2025
[36]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834

work page arXiv 2024
[38]

Findings of the 2022 conference on machine translation ( WMT 22)

Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

work page 2022
[40]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

org/CorpusID:220265858

Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334

work page arXiv 2024
[43]

Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349

work page arXiv 2025
[44]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025

work page arXiv 2024
[46]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800

work page arXiv 2024
[47]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526

work page arXiv 2025
[48]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023

work page 2023
[49]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853

work page arXiv 2024
[50]

Compact language models via pruning and knowledge distillation, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679

work page arXiv 2024
[51]

Pre-training distillation for large language models: A design space exploration, 2024

Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215

work page arXiv 2024
[52]

Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873

work page arXiv 2025
[53]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017

work page 2017
[56]

Language models are multilingual chain-of-thought reasoners, 2022

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

work page 2022
[57]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795

work page arXiv 2026
[58]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

arXiv preprint arXiv:2502.07780 , year=

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780

work page arXiv 2025
[60]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
[61]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[63]

Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

work page 2025
[64]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[65]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information

Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025

work page 2025
[67]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456

work page 2024
[68]

arXiv preprint arXiv:2310.06694 (2023)

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694

work page arXiv 2024
[69]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Laco: Large language model pruning via layer collapse, 2024

Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187

work page arXiv 2024
[73]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019

work page 2019
[74]

Le and Geoffrey E

Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =

work page
[75]

Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =

work page
[76]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[77]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[78]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

work page
[79]

CoRR , volume =

Gemini Team , title =. CoRR , volume =. 2025 , doi =

work page 2025
[80]

2025 , eprint=

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=

work page 2025
[81]

2025 , eprint=

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

work page 2025
[82]

2025 , eprint=

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=

work page 2025
[83]

2024 , eprint=

ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=

work page 2024

Showing first 80 references.

[1] [1]

2019 , eprint=

HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

work page 2019

[2] [2]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021

[3] [3]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021

[4] [4]

2023 , eprint=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=

work page 2023

[5] [5]

2024 , eprint=

CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=

work page 2024

[6] [6]

2023 , eprint=

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=

work page 2023

[7] [7]

2024 , eprint=

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=

work page 2024

[8] [8]

2025 , eprint=

Are We Done with MMLU? , author=. 2025 , eprint=

work page 2025

[9] [9]

2022 , eprint=

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=

work page 2022

[10] [10]

2025 , eprint=

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=

work page 2025

[11] [11]

2024 , eprint=

Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=

work page 2024

[12] [12]

2024 , eprint=

SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=

work page 2024

[13] [13]

2024 , eprint=

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=

work page 2024

[14] [14]

2024 , eprint=

LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=

work page 2024

[15] [15]

2024 , eprint=

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=

work page 2024

[16] [16]

2024 , eprint=

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=

work page 2024

[17] [17]

2025 , eprint=

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=

work page 2025

[18] [18]

2024 , eprint=

Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=

work page 2024

[19] [19]

2024 , eprint=

Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=

work page 2024

[20] [20]

2025 , eprint=

DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=

work page 2025

[21] [21]

2025 , eprint=

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=

work page 2025

[22] [22]

2025 , eprint=

Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=

work page 2025

[23] [23]

Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =

work page

[24] [24]

2025 , eprint=

Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

work page 2025

[25] [25]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024

work page arXiv 2024

[26] [26]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[27] [27]

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Mtbench: A multimodal time series benchmark for temporal reasoning and question answering.arXiv preprint arXiv:2503.16858, 2025

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858

work page arXiv 2026

[29] [29]

Icleval: Evaluating in-context learning ability of large language models, 2024

Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955

work page arXiv 2024

[30] [30]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127

work page arXiv 2025

[32] [32]

Better & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[33] [33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322

work page arXiv 2023

[35] [35]

Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025

Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586

work page arXiv 2025

[36] [36]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

arXiv preprint arXiv:2402.02834 , volume=

Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834

work page arXiv 2024

[38] [38]

Findings of the 2022 conference on machine translation ( WMT 22)

Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

work page 2022

[39] [40]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [41]

CMMLU: Measuring massive multitask language understanding in Chinese

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

org/CorpusID:220265858

Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334

work page arXiv 2024

[42] [43]

Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025

Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349

work page arXiv 2025

[43] [44]

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025

work page arXiv 2024

[45] [46]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large lan- guage models.arXiv preprint arXiv:2402.14800,

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800

work page arXiv 2024

[46] [47]

Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks.arXiv preprint arXiv:2410.06526, 2024

Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526

work page arXiv 2025

[47] [48]

Llm-pruner: On the structural pruning of large language models

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023

work page 2023

[48] [49]

Shortgpt: Layers in large language models are more redundant than you expect

Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853

work page arXiv 2024

[49] [50]

Compact language models via pruning and knowledge distillation, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679

work page arXiv 2024

[50] [51]

Pre-training distillation for large language models: A design space exploration, 2024

Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215

work page arXiv 2024

[51] [52]

Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873

work page arXiv 2025

[52] [53]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [55]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017

work page 2017

[54] [56]

Language models are multilingual chain-of-thought reasoners, 2022

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022

work page 2022

[55] [57]

The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795

work page arXiv 2026

[56] [58]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022

[57] [59]

arXiv preprint arXiv:2502.07780 , year=

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780

work page arXiv 2025

[58] [60]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025

[59] [61]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [62]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024

[61] [63]

Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b

work page 2025

[62] [64]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026

[63] [65]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [66]

CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information

Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025

work page 2025

[65] [67]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456

work page 2024

[66] [68]

arXiv preprint arXiv:2310.06694 (2023)

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694

work page arXiv 2024

[67] [69]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [70]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [71]

Laco: Large language model pruning via layer collapse, 2024

Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187

work page arXiv 2024

[70] [73]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019

work page 2019

[71] [74]

Le and Geoffrey E

Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =

work page

[72] [75]

Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =

work page

[73] [76]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[74] [77]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page

[75] [78]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

work page

[76] [79]

CoRR , volume =

Gemini Team , title =. CoRR , volume =. 2025 , doi =

work page 2025

[77] [80]

2025 , eprint=

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=

work page 2025

[78] [81]

2025 , eprint=

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

work page 2025

[79] [82]

2025 , eprint=

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=

work page 2025

[80] [83]

2024 , eprint=

ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=

work page 2024