arxiv: 2603.06003 · v2 · submitted 2026-03-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

Zongfang Liu , Shengkun Tang , Boyang Sun , Zhiqiang Shen , Xin Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse mixture of expertsexpert pruningnon-uniform sparsityevolutionary searchspeculative decodinglanguage model compressionlarge language models

0 comments

The pith

Evolutionary search finds non-uniform layer budgets that improve pruned sparse MoE generation performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse Mixture-of-Experts models must store every expert even after pruning, so the way the total pruning budget is split across layers matters for final capability. The paper shows that uniform splits are often suboptimal and introduces an evolutionary search that redistributes the same total sparsity into different per-layer amounts. A cheap proxy called ESAP ranks candidate allocations by measuring how closely a pruned model matches the full model under teacher-forced speculative steps, avoiding expensive full generation runs. When the search is run on 7B to 30B models at 25% and 50% sparsity, the resulting non-uniform patterns raise open-ended math and generation scores while keeping multiple-choice accuracy close to the uniform baseline. The method works as a plug-in on top of any existing within-layer expert ranking rule.

Core claim

Decoupling within-layer expert ranking from across-layer budget allocation allows an evolutionary search, guided by the bounded Expected Speculative Acceptance Proxy, to locate non-uniform sparsity patterns that preserve more of the original model's open-ended generation behavior than uniform patterns at identical global sparsity.

What carries the argument

EvoESAP evolutionary search over layer-wise sparsity allocations, which ranks candidates using the Expected Speculative Acceptance Proxy (ESAP) while keeping the within-layer expert order fixed.

If this is right

At fixed total sparsity the model retains higher accuracy on math and open-ended generation than uniform pruning allows.
Multiple-choice accuracy remains competitive, so the gains do not trade off one capability for another.
Any existing within-layer pruning criterion can be paired with the non-uniform allocation search without modification.
Search cost stays low because ESAP is stable and bounded and does not require running full autoregressive decoding for every candidate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy-guided search could be applied at finer granularity to choose both which experts and how many per layer in a single joint optimization.
Allocations discovered on general data might be further tuned on task-specific validation sets for additional targeted gains.
Similar bounded proxies could be derived for other MoE compression axes such as expert capacity or routing temperature.

Load-bearing premise

The ESAP proxy computed from teacher-forced speculative acceptance ranks pruning allocations in the same order as their actual performance under full autoregressive open-ended generation.

What would settle it

Full autoregressive evaluation on the allocations returned by EvoESAP shows no gain or a loss relative to uniform pruning on the same open-ended generation tasks and model sizes.

Figures

Figures reproduced from arXiv: 2603.06003 by Boyang Sun, Shengkun Tang, Xin Yuan, Zhiqiang Shen, Zongfang Liu.

**Figure 1.** Figure 1: Layer-wise density schedules and performance for OLMoE-1B-7B-0125-Instruct at 25% global sparsity (density = [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of EvoESAP. (a) Evolutionary search with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Layer-wise density distributions (density [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains constrained by memory footprint and throughput because the full expert pool must still be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the layer-wise allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model without costly autoregressive decoding. ESAP is bounded and stable, enabling cheap comparison of many candidates. Building on ESAP, we propose EvoESAP, an evolutionary search framework that finds an improved non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method for criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity. Code is available at https://github.com/ZongfangLiu/EvoESAP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoESAP uses evolutionary search over layer budgets guided by a teacher-forced speculative proxy to beat uniform pruning on open-ended tasks in MoE models.

read the letter

The main point is that they split expert pruning into within-layer ranking and across-layer budget allocation, then use an evolutionary algorithm to pick non-uniform sparsity per layer under a fixed total budget. The search runs on ESAP, a bounded proxy that estimates how closely a pruned model matches the original via teacher-forced speculative acceptance instead of full autoregressive runs. This lets them test many allocations cheaply and plug the method into existing ranking schemes like frequency or EAN. On 7B-30B models at 25% and 50% sparsity they report better open-ended results, including +19.6% on MATH-500 at 50% sparsity, while multiple-choice accuracy stays competitive with uniform baselines. Code release is a plus for checking the details. The proxy idea and the non-uniform search are the actual new pieces; most prior pruning work keeps layer budgets uniform. The gains look practically useful for inference memory and quality trade-offs. The main soft spot is whether ESAP rankings reliably track full open-ended metrics; the abstract does not include correlation checks or ablations against joint optimization of ranking and allocation, so the discovered budgets could partly be proxy artifacts. Statistical controls and run counts are also not visible here, which matters for the size of the reported lifts. This is aimed at people doing post-training compression and efficient serving of large MoE models. Anyone already using frequency-based or activation-based pruning would see immediate value in swapping in the allocation step. It is solid enough on its own terms to deserve a serious referee who can verify the proxy correlation and run the numbers.

Referee Report

3 major / 2 minor

Summary. The manuscript decouples expert pruning in Sparse Mixture-of-Experts (SMoE) models into within-layer ranking and across-layer budget allocation. It introduces the Expected Speculative Acceptance Proxy (ESAP), a teacher-forced, speculative-decoding-inspired metric for cheaply ranking pruning allocations, and EvoESAP, an evolutionary search that optimizes non-uniform layer-wise sparsity budgets under a fixed global sparsity target while holding within-layer expert order fixed. The central claim is that EvoESAP yields allocations that improve open-ended generation (up to +19.6% on MATH-500 at 50% sparsity) relative to uniform pruning across 7B–30B models while preserving competitive multiple-choice accuracy; the method is presented as plug-and-play for existing ranking criteria such as Frequency, EAN, SEER, and REAP.

Significance. If the reported gains are robust, the work provides a practical, low-cost way to improve post-training pruning of deployed SMoE models by optimizing layer-wise budgets rather than defaulting to uniformity. The bounded, stable ESAP proxy and code release are concrete strengths that lower the barrier to adoption. The decoupling insight is useful even if the specific search algorithm is later refined.

major comments (3)

[Abstract] Abstract and results section: the headline +19.6% MATH-500 gain at 50% sparsity is presented without reported run counts, standard deviations, or multiple-testing correction; because this number is the primary evidence for the superiority of non-uniform allocations, the absence of these controls makes the central empirical claim difficult to assess.
[ESAP definition] ESAP definition and validation: the manuscript defines ESAP via teacher-forced speculative acceptance but provides no quantitative correlation (e.g., rank correlation or scatter plot) between ESAP scores and full autoregressive open-ended metrics; if this correlation is weak, the evolutionary search may be optimizing a misaligned objective.
[Method] Separability assumption in the method: the paper fixes within-layer expert ranking order and searches only over layer-wise budgets, yet offers no ablation comparing this to joint optimization of ranking and allocation; the reported gains could be capped by this design choice.

minor comments (2)

[Notation] Clarify the precise mathematical definition of ESAP (including any temperature or acceptance threshold) in the main text rather than relying solely on the high-level abstract description.
[Experiments] Add a brief table or figure caption note on the exact sparsity targets (25% and 50%) and model sizes used for each reported number.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive review. We address each major comment point by point below and will revise the manuscript to strengthen the empirical claims and add requested analyses.

read point-by-point responses

Referee: [Abstract] Abstract and results section: the headline +19.6% MATH-500 gain at 50% sparsity is presented without reported run counts, standard deviations, or multiple-testing correction; because this number is the primary evidence for the superiority of non-uniform allocations, the absence of these controls makes the central empirical claim difficult to assess.

Authors: We agree that reporting run counts, standard deviations, and addressing multiple-testing concerns would strengthen the central claim. In the revised manuscript we will rerun the key open-ended generation experiments (including MATH-500 at 50% sparsity) over at least three independent random seeds, report means with standard deviations, and update both the abstract and results section. We will also note the total number of configurations evaluated to contextualize the headline figure. revision: yes
Referee: [ESAP definition] ESAP definition and validation: the manuscript defines ESAP via teacher-forced speculative acceptance but provides no quantitative correlation (e.g., rank correlation or scatter plot) between ESAP scores and full autoregressive open-ended metrics; if this correlation is weak, the evolutionary search may be optimizing a misaligned objective.

Authors: We acknowledge that explicit validation of the ESAP proxy against full autoregressive metrics is important. In the revision we will add a new analysis (in Section 3 and the appendix) that reports Spearman rank correlation and a scatter plot between ESAP scores and actual open-ended generation performance (MATH-500 and GSM8K) across a diverse set of pruning allocations on the 7B model. This will quantify the alignment of the proxy objective. revision: yes
Referee: [Method] Separability assumption in the method: the paper fixes within-layer expert ranking order and searches only over layer-wise budgets, yet offers no ablation comparing this to joint optimization of ranking and allocation; the reported gains could be capped by this design choice.

Authors: The separability assumption is intentional to preserve plug-and-play compatibility with any existing within-layer ranking criterion. We will add an ablation study in the revised manuscript that compares the current EvoESAP approach against a joint optimization of both ranking and allocation on the 7B model, reporting the additional performance delta (if any) to quantify the potential cap on gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity: ESAP proxy and final benchmarks are independent

full rationale

The paper defines ESAP as a standalone teacher-forced speculative acceptance metric, employs it only to guide evolutionary search over layer-wise budgets (with within-layer order held fixed), and then measures the resulting allocations on external benchmarks (MATH-500, multiple-choice accuracy). These benchmark numbers are not computed from the ESAP objective or any fitted parameter; they are direct evaluations. No self-citation chain, ansatz smuggling, or renaming of known results appears in the derivation. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on standard evolutionary search and a new but fully described proxy metric.

pith-pipeline@v0.9.0 · 5603 in / 1092 out tokens · 45937 ms · 2026-05-15T14:31:15.956780+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across 7B–30B SMoE LLMs at 25% and 50% sparsity, EvoESAP consistently discovers non-uniform allocations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
Model Compression with Exact Budget Constraints via Riemannian Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 2 Pith papers · 18 internal anchors

[1]

Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, and Song Guo. 2025. DiEP: Adap- tive Mixture-of-Experts Compression through Differentiable Expert Pruning. arXiv:2509.16105 [cs.CL] https://arxiv.org/abs/2509.16105

work page arXiv 2025
[2]

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge.TAC7, 8 (2009), 1

work page 2009
[3]

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Lau- rent Sifre, and John Jumper. 2023. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

I Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun- Yi Lee, et al . 2024. Retraining-Free Merging of Sparse MoE via Hierarchical Clustering.arXiv preprint arXiv:2410.08589(2024)

work page arXiv 2024
[5]

Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. 2022. Task-specific expert pruning for sparse mixture- of-experts.arXiv preprint arXiv:2206.00277(2022)

work page arXiv 2022
[6]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[7]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture- of-Experts Language Model. arXiv:2405.04434 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120 (2022), 1–39

work page 2022
[11]

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al

work page
[12]

A framework for few-shot language model evaluation.Zenodo(2021)

work page 2021
[13]

Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, and Yike Guo. 2025. Delta decompression for moe-based llms compression.arXiv preprint arXiv:2502.17298(2025)

work page arXiv 2025
[14]

Shwai He, Daize Dong, Liang Ding, and Ang Li. 2024. Towards Efficient Mix- ture of Experts: A Holistic Study of Compression Techniques.arXiv preprint arXiv:2406.02500(2024)

work page arXiv 2024
[15]

Shwai He, Run-Ze Fan, Liang Ding, Li Shen, Tianyi Zhou, and Dacheng Tao

work page
[16]

Merging experts into one: Improving computational efficiency of mixture of experts.arXiv preprint arXiv:2310.09832(2023)

work page arXiv 2023
[17]

Yifei He, Yang Liu, Chen Liang, and Hany Hassan Awadalla. 2025. Efficiently Editing Mixture-of-Experts Models with Compressed Experts.arXiv preprint arXiv:2503.00634(2025)

work page arXiv 2025
[18]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.ICLR1, 2 (2022), 3

work page 2022
[21]

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hong- sheng Li, Si Liu, and Xiaojuan Qi. 2024. Mixture Compressor for Mixture-of- Experts LLMs Gains More.arXiv preprint arXiv:2410.06270(2024)

work page arXiv 2024
[22]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Ajay Jaiswal, Jianyu Wang, Yixiao Li, Pingzhi Li, Tianlong Chen, Zhangyang Wang, Chong Wang, Ruoming Pang, and Xianzhi Du. 2025. Finding Fantastic Ex- perts in MoEs: A Unified Study for Expert Dropping Strategies and Observations. arXiv preprint arXiv:2504.05586(2025)

work page arXiv 2025
[24]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Yeskendir Koishekenov, Alexandre Berard, and Vassilina Nikoulina. 2023. Memory-efficient nllb-200: Language-specific expert pruning of a massively mul- tilingual machine translation model. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3567–3585

work page 2023
[26]

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, and Vithur- san Thangarasa. 2025. REAP the Experts: Why Pruning Prevails for One-Shot MoE compression.arXiv preprint arXiv:2510.13999(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Jaeseong Lee, Seung-won Hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, and Yuxiong He. 2025. Stun: Structured-then-unstructured pruning for scalable moe pruning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13660–13676

work page 2025
[28]

Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. 2020. Layer-adaptive sparsity for the magnitude-based pruning.arXiv preprint arXiv:2010.07611(2020)

work page arXiv 2020
[29]

Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

work page 2023
[30]

Lujun Li, Zhu Qiyuan, Jiacheng Wang, Wei Li, Hao Gu, Sirui Han, and Yike Guo

work page
[31]

Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging.arXiv preprint arXiv:2506.23266(2025)

work page arXiv 2025
[32]

Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. 2023. Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334(2023)

work page arXiv 2023
[33]

Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. 2025. MoE-SVD: Structured Mixture-of-Experts LLMs Compres- sion via Singular Value Decomposition. InInternational Conference on Machine Learning. PMLR, 35209–35230

work page 2025
[34]

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi

work page
[35]

Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770(2024)

work page arXiv 2024
[36]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. 2024. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs.arXiv preprint arXiv:2407.00945(2024)

work page arXiv 2024
[38]

Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng- Ann Heng, Minyi Guo, and Chao Li. 2024. A survey on inference optimization techniques for mixture of experts models.arXiv preprint arXiv:2412.14219(2024)

work page arXiv 2024
[39]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

work page 2023
[40]

Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. 2022. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training.arXiv MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Anonymous Author(s) preprint arXiv:2202.02643(2022)

work page arXiv 2022
[41]

Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W Mahoney, and Yaoqing Yang. 2024. Alphapruning: Using heavy-tailed self regularization theory for improved layer-wise pruning of large language models.Advances in neural information processing systems37 (2024), 9117–9152

work page 2024
[42]

Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. 2024. Not all experts are equal: Efficient expert prun- ing and skipping for mixture-of-experts large language models.arXiv preprint arXiv:2402.14800(2024)

work page arXiv 2024
[43]

Meta AI. 2025. The Llama 4 Herd: The Beginning of a New Era of Natively Multi- modal AI Innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

work page 2025
[44]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Se- won Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. 2024. Ol- moe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060 (2024)

work page arXiv 2024
[46]

Alexandre Muzio, Alex Sun, and Churan He. 2024. Seer-moe: Sparse ex- pert efficiency through regularization for mixture-of-experts.arXiv preprint arXiv:2404.05089(2024)

work page arXiv 2024
[47]

OpenAI. 2025. gpt-oss-120b & gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale.Commun. ACM 64, 9 (2021), 99–106

work page 2021
[49]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[50]

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. 2023. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695 (2023)

work page arXiv 2023
[51]

Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, and Dan Alistarh

work page
[52]

arXiv preprint arXiv:2502.07780(2025)

Darwinlm: Evolutionary structured pruning of large language models. arXiv preprint arXiv:2502.07780(2025)

work page arXiv 2025
[53]

Baidu ERNIE Team. 2025. ERNIE 4.5 Technical Report

work page 2025
[54]

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al . 2025. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

ModelScope Team. 2024. EvalScope: Evaluation Framework for Large Models. https://github.com/modelscope/evalscope

work page 2024
[56]

Qwen Team. 2024. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters. https://qwenlm.github.io/blog/qwen-moe/

work page 2024
[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in neural information processing systems30 (2017)

work page 2017
[58]

Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, and An Xu. 2024. Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router.arXiv preprint arXiv:2410.12013 (2024)

work page arXiv 2024
[59]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, and Bo Yuan. 2024. MoE-I 2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition.arXiv preprint arXiv:2411.01016(2024)

work page arXiv 2024
[61]

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al . 2023. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175(2023)

work page arXiv 2023
[62]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

work page
[63]

Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905
[64]

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. 2025. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, and Jianfeng Gao

work page
[66]

InFindings of the Association for Computational Linguistics: ACL 2025

Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts. InFindings of the Association for Computational Linguistics: ACL 2025. 86–102

work page 2025
[67]

Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan, et al. 2025. Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs.arXiv preprint arXiv:2509.10377 (2025). EvoESAP: Non-Uniform Expert Pruning for Sparse MoE MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil A Evolutionary S...

work page arXiv 2025