Recognition: no theorem link
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Pith reviewed 2026-05-12 03:24 UTC · model grok-4.3
The pith
Pruning a pretrained large MoE consistently outperforms training the smaller target architecture from scratch under the same training budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining, and a simple partial-preservation expert merging strategy improves downstream performance across most benchmarks. Combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks, and multi-token prediction distillation yields consistent gains. Progressive pruning schedules outperform one-shot compression, enabling effective reduction of large MoE models.
What carries the argument
Progressive pruning schedules applied to pretrained MoE models during continued pretraining, together with partial-preservation expert merging and multi-token prediction knowledge distillation.
If this is right
- Pruning a pretrained MoE supplies a stronger initialization than random weights for the target size under equal compute.
- One-shot expert compression methods become interchangeable after sufficient continued pretraining.
- Partial-preservation expert merging raises accuracy on downstream tasks relative to standard merging.
- Pairing knowledge distillation with the language modeling loss, especially via multi-token prediction, improves results over distillation alone.
- Gradual architecture changes during pruning produce better optimization paths than sudden compression.
Where Pith is reading between the lines
- The convergence of different compression methods implies that the continued pretraining phase matters more than the precise initial pruning choice.
- These schedules could lower the cost of creating families of smaller MoE models tailored to specific domains.
- Similar progressive transitions might improve compression results for dense transformer models as well.
- The extra gains on knowledge-intensive tasks suggest the approach could help build more efficient models for reasoning workloads.
Load-bearing premise
The performance advantages seen for pruning over scratch training and for progressive over one-shot schedules will continue to appear in other large MoE architectures and pretraining datasets.
What would settle it
A direct side-by-side comparison, on the same downstream benchmarks, of a smaller MoE trained entirely from scratch for a fixed total token count versus the identical smaller architecture obtained by pruning a larger pretrained MoE and then continuing pretraining for the remaining tokens.
Figures
read the original abstract
Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically studies structured pruning and knowledge distillation for compressing large MoE models during pretraining. It claims that pruning a pretrained MoE outperforms training the target architecture from scratch under the same training budget across depth, width, and expert compression; that different one-shot expert compression methods converge to similar performance after continual pretraining (motivating a partial-preservation expert merging strategy); that combining KD with language modeling loss (plus multi-token prediction distillation) improves results; and that progressive pruning outperforms one-shot compression for the same training tokens. The authors compress Qwen3-Next-80A3B to a 23A2B model with competitive downstream performance.
Significance. If the trends hold under rigorous verification, the work supplies practical guidance for efficient large-scale MoE compression, highlighting the value of progressive schedules and initialization from pruning. This could reduce pretraining costs while preserving performance on knowledge-intensive tasks.
major comments (2)
- [Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.
- [Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.
minor comments (3)
- The notation for MoE sizes (80A3B, 23A2B) should be defined explicitly in the introduction or a dedicated notation section.
- Figure captions and legends for pruning schedule comparisons should be self-contained, highlighting key metrics without requiring reference to the main text.
- [Abstract] The abstract would be strengthened by naming the specific downstream benchmarks on which the 23A2B model retains competitive performance.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications on the training budget definition and the statistical aspects of our results while committing to revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the primary claim that 'pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget' is load-bearing, yet the budget metric is unspecified. Expert pruning reduces active parameters and per-token FLOPs; equating budgets by token count (rather than total compute or wall-clock time) would allocate strictly less computation to pruned models, confounding whether the advantage arises from better initialization. This ambiguity must be resolved with explicit FLOPs reporting in the experimental protocol.
Authors: We appreciate this observation regarding the need for explicit definition. In the manuscript, the training budget is equated by the number of tokens processed during continued pretraining, which is the conventional metric in large-scale pretraining studies. We acknowledge that this results in lower total FLOPs for pruned models owing to fewer active parameters. The performance advantage is therefore achieved despite reduced per-token compute, which we interpret as evidence for the benefit of the pruning-derived initialization. We will revise the abstract and methods to explicitly state the token-based budget and incorporate detailed FLOPs calculations comparing the two settings. revision: yes
-
Referee: [Experimental results] Experimental results on continual pretraining: the reported convergence of one-shot expert compression methods to similar final performance, and the superiority of progressive schedules, lack error bars, multiple-run statistics, or ablations isolating benchmark selection and hyperparameter effects. Without these controls, it is impossible to confirm the trends are not artifacts of post-hoc choices or the specific Qwen3-Next-80A3B setup, weakening the generalization of the central pruning-vs-scratch finding.
Authors: We recognize that additional statistical controls would increase robustness. However, the computational cost of full-scale pretraining on models of this size precludes multiple independent runs. Results are reported from single runs with fixed seeds and hyperparameters. The observed convergence and progressive pruning advantages appear consistently across depth, width, and expert compression axes as well as across diverse benchmarks, which provides supporting evidence against isolated artifacts. In revision we will add an explicit limitations discussion of the single-run design and include hyperparameter sensitivity results from smaller-scale ablations. revision: partial
Circularity Check
No circularity: purely empirical comparisons with no derivations
full rationale
The paper reports experimental results on pruning, merging, and distillation strategies for MoE models during pretraining. All central claims (pruning outperforms scratch training under matched budget, one-shot methods converge, progressive schedules win, KD+MTP helps) are supported by direct measurements on Qwen3-Next-80A3B and downstream benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked that could reduce to fitted parameters or self-citations by construction. The work contains no load-bearing self-citations for theoretical premises, no ansatz smuggling, and no renaming of known results as novel organization. The budget-metric ambiguity noted by the skeptic is a methodological limitation, not a circularity in any derivation chain. This is the expected outcome for an empirical compression study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continued pretraining after compression allows different initial compression choices to converge to similar performance
Reference graph
Works this paper leans on
-
[1]
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=
work page 2019
-
[2]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
-
[3]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[4]
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. 2023 , eprint=
work page 2023
-
[5]
CMMLU: Measuring massive multitask language understanding in Chinese , author=. 2024 , eprint=
work page 2024
-
[6]
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation , author=. 2023 , eprint=
work page 2023
-
[7]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark , author=. 2024 , eprint=
work page 2024
- [8]
-
[9]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , author=. 2022 , eprint=
work page 2022
-
[10]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. 2025 , eprint=
work page 2025
-
[11]
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning , author=. 2024 , eprint=
work page 2024
-
[12]
SliceGPT: Compress Large Language Models by Deleting Rows and Columns , author=. 2024 , eprint=
work page 2024
-
[13]
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect , author=. 2024 , eprint=
work page 2024
-
[14]
LaCo: Large Language Model Pruning via Layer Collapse , author=. 2024 , eprint=
work page 2024
-
[15]
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods , author=. 2024 , eprint=
work page 2024
-
[16]
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models , author=. 2024 , eprint=
work page 2024
-
[17]
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression , author=. 2025 , eprint=
work page 2025
-
[18]
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy , author=. 2024 , eprint=
work page 2024
-
[19]
Compact Language Models via Pruning and Knowledge Distillation , author=. 2024 , eprint=
work page 2024
-
[20]
DarwinLM: Evolutionary Structured Pruning of Large Language Models , author=. 2025 , eprint=
work page 2025
-
[21]
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation , author=. 2025 , eprint=
work page 2025
-
[22]
Gated Delta Networks: Improving Mamba2 with Delta Rule , author=. 2025 , eprint=
work page 2025
-
[23]
Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan , year =
-
[24]
Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=
work page 2025
-
[25]
Slicegpt: Compress large language models by deleting rows and columns,
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns, 2024. URL https://arxiv.org/abs/2401.15024
-
[26]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, and Lu Yin. Condense, don't just prune: Enhancing efficiency and performance in moe layer pruning, 2025. URL https://arxiv.org/abs/2412.00069
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026
Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, and Rex Ying. Mtbench: A multimodal time series benchmark for temporal reasoning and question answering, 2026. URL https://arxiv.org/abs/2503.16858
-
[29]
Icleval: Evaluating in-context learning ability of large language models, 2024
Wentong Chen, Yankai Lin, ZhenHao Zhou, HongYun Huang, Yantao Jia, Zhao Cao, and Ji-Rong Wen. Icleval: Evaluating in-context learning ability of large language models, 2024. URL https://arxiv.org/abs/2406.14955
-
[30]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2025. URL https://arxiv.org/abs/2406.04127
-
[32]
Better & faster large language models via multi-token prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozi \` e re, David Lopez - Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[33]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023. URL https://arxiv.org/abs/2305.08322
-
[35]
Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. Finding fantastic experts in moes: A unified study for expert dropping strategies and observations, 2025. URL https://arxiv.org/abs/2504.05586
-
[36]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, L \' e lio Renard Lavaud, Lucile Saulnier, Marie - Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024
Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, and Hyoung-Kyu Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods, 2024. URL https://arxiv.org/abs/2402.02834
-
[38]
Findings of the 2022 conference on machine translation ( WMT 22)
Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...
work page 2022
-
[40]
REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithursan Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression, 2025 b . URL https://arxiv.org/abs/2510.13999
work page internal anchor Pith review arXiv 2025
-
[41]
CMMLU: Measuring Massive Multitask Language Understanding in Chinese.arXiv:2306.09212, 2023a
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2024 a . URL https://arxiv.org/abs/2306.09212
-
[42]
Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024 b . URL https://arxiv.org/abs/2310.01334
-
[43]
Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025
Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, and Tuo Zhao. Slimmoe: Structured compression of large moe models via expert slimming and distillation, 2025. URL https://arxiv.org/abs/2506.18349
-
[44]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation, 2023. URL https://arxiv.org/abs/2305.01210
work page internal anchor Pith review arXiv 2023
-
[45]
Repoqa: Evaluating long context code understanding
Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding, 2024. URL https://arxiv.org/abs/2406.06025
-
[46]
Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024. URL https://arxiv.org/abs/2402.14800
-
[47]
Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, and Ge Zhang. Kor-bench: Benchmarking language models on knowledge-orthogonal reasoning tasks, 2025. URL https://arxiv.org/abs/2410.06526
-
[48]
Llm-pruner: On the structural pruning of large language models
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[49]
Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853
-
[50]
Compact language models via pruning and knowledge distillation.arXiv preprint arXiv:2407.14679,
Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation, 2024. URL https://arxiv.org/abs/2407.14679
-
[51]
Pre-training distillation for large language models: A design space exploration, 2024
Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, and Juanzi Li. Pre-training distillation for large language models: A design space exploration, 2024. URL https://arxiv.org/abs/2410.16215
-
[52]
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for training specialized mixture-of-expert models, 2025 a . URL https://arxiv.org/abs/2501.11873
-
[53]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025 b . URL https://arxiv.org/abs/2505.06708
work page internal anchor Pith review arXiv 2025
-
[55]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, 2017
work page 2017
-
[56]
Language models are multilingual chain-of-thought reasoners, 2022
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022
work page 2022
-
[57]
The curse of depth in large language models.arXiv preprint arXiv:2502.05795, 2025
Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, and Shiwei Liu. The curse of depth in large language models, 2026. URL https://arxiv.org/abs/2502.05795
-
[58]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. URL https://arxiv.org/abs/2210.09261
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh. Darwinlm: Evolutionary structured pruning of large language models, 2025. URL https://arxiv.org/abs/2502.07780
-
[60]
Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. CoRR, abs/2507.06261, 2025 a . doi:10.48550/ARXIV.2507.06261
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.06261 2025
-
[61]
P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Z...
-
[62]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[63]
Qwen3-next: Towards ultimate training & inference efficiency, 2025 b
Qwen Team. Qwen3-next: Towards ultimate training & inference efficiency, 2025 b
work page 2025
-
[64]
Qwen3.5: Accelerating productivity with native multimodal agents, February 2026
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[65]
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024. URL https://arxiv.org/abs/2406.01574
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information
Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, and Bing Qin. CFSP: an efficient structured pruning framework for llms with coarse-to-fine activation information. In Proceedings of the 31st International Conference on Computational Linguistics, 2025
work page 2025
-
[67]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Findings of the Association for Computational Linguistics, 2024 a . URL https://aclanthology.org/2024.findings-acl.456
work page 2024
-
[68]
arXiv preprint arXiv:2310.06694 , year=
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning, 2024 b . URL https://arxiv.org/abs/2310.06694
-
[69]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025 b . URL https://arxiv.org/abs/2412.06464
work page internal anchor Pith review arXiv 2025
-
[71]
Laco: Large language model pruning via layer collapse, 2024
Yifei Yang, Zouying Cao, and Hai Zhao. Laco: Large language model pruning via layer collapse, 2024. URL https://arxiv.org/abs/2402.11187
-
[72]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, 2019
work page 2019
-
[73]
Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. 5th International Conference on Learning Representations , year =
-
[74]
Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and Gianna Lengyel and Guillaume Bour and Guillaume Lample and L. Mixtral of Experts , journal =
- [75]
-
[76]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
-
[77]
Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
- [78]
-
[79]
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models , author=. 2025 , eprint=
work page 2025
-
[80]
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=
work page 2025
-
[81]
KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks , author=. 2025 , eprint=
work page 2025
-
[82]
ICLEval: Evaluating In-Context Learning Ability of Large Language Models , author=. 2024 , eprint=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.