Pruning and Distilling Mixture-of-Experts into Dense Language Models

Gyeongman Kim; Haechan Kim; Jaewoong Cho; Jihun Yun; Joonghyun Bae; Junhyuck Kim

arxiv: 2605.28207 · v2 · pith:SLIAF6AQnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.LG

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Junhyuck Kim , Jihun Yun , Haechan Kim , Gyeongman Kim , Joonghyun Bae , Jaewoong Cho This is my paper

Pith reviewed 2026-06-29 12:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mixture of expertsmodel compressionknowledge distillationdense modelsexpert selectionlanguage model pruning

0 comments

The pith

Converting a trained Mixture-of-Experts model to a dense model by scoring, grouping and distilling experts outperforms pruning a dense model by 6.3 percentage points at matched size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a systematic way to turn a trained MoE language model into an ordinary dense model. Experts receive scores, the best ones are chosen and grouped, their weights are concatenated into a single feed-forward layer, and the whole network is refined by distilling knowledge from the original MoE. The authors test seven scoring methods, five grouping strategies and two scaling approaches on three different MoE models, running hundreds of configurations. Diversity-aware scoring proves most effective, and under controlled matched-parameter comparisons the distilled dense model reaches 6.3 points higher average downstream accuracy than a dense model obtained by pruning another dense model, while the distillation phase finishes 1.6 times faster.

Core claim

A trained MoE can be converted into a standard dense model by scoring experts, selecting and grouping them, concatenating their parameters into a dense FFN and distilling from the MoE teacher; the resulting dense model exceeds the downstream accuracy of a dense model produced by pruning a dense teacher at the same parameter count.

What carries the argument

Diversity-aware scoring of experts, which ranks them to maximize coverage of distinct knowledge before grouping and distillation into a dense feed-forward network.

If this is right

Frontier MoE models become usable in memory-limited settings without having to load every expert at inference time.
Knowledge distillation from an MoE teacher trains to target accuracy faster than distillation from a dense teacher of equal size.
Scoring method dominates performance among the tested design choices across Qwen3-30B-A3B, DeepSeek-V2-Lite and GPT-OSS-20B.
The conversion leaves a fully dense model whose inference cost and memory footprint match those of any standard transformer of the same width.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could let practitioners run large MoE checkpoints on edge devices by shipping only the distilled dense version.
Task-specific re-scoring of experts after initial conversion might recover additional accuracy without retraining the entire student.
If the diversity signal generalizes, future MoE training runs could deliberately encourage expert specialization knowing that excess experts can later be folded into dense layers.

Load-bearing premise

The expert selection and grouping choices identified on the three evaluated models will produce comparable gains on other MoE architectures or larger scales without additional hyper-parameter search.

What would settle it

Apply the same scoring-grouping-distillation pipeline to a fourth, previously unseen MoE model and observe that the final dense student underperforms a matched-size dense-to-dense pruned baseline by more than three points on the same downstream suite.

read the original abstract

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical pipeline for turning MoE models into dense ones via expert scoring, grouping, and distillation, with a new diversity-aware scorer that beats earlier methods on the tested models, but the headline gain over dense pruning may not be cleanly isolated.

read the letter

The core contribution is a systematic way to convert a trained MoE into a standard dense model: score the experts, pick and group them, concatenate into an FFN, then distill from the original MoE teacher. They run 350 configurations across scoring, grouping, and scaling choices on Qwen3-30B-A3B and check the best scorers on two other models. The diversity-aware scoring is new and consistently better than prior options. They also report 1.6x faster training wall-clock and a +6.3 pp downstream gain over dense-to-dense pruning at matched parameter count after roughly 4B tokens of distillation.

The setup is useful for anyone who wants to deploy high-capacity MoE models on hardware that cannot hold all experts. The scale of the ablation (350 runs) and the cross-model check are the parts that stand out as solid empirical work.

The main uncertainty is whether the dense-to-dense pruning baseline receives the same ~4B-token distillation from the MoE teacher. The abstract ties the reported gain to the full MoE-to-dense pipeline but does not explicitly state that the pruning baseline gets identical distillation treatment. If the baseline is evaluated right after pruning without that step, the delta cannot be attributed to the expert selection and grouping alone. That needs a clear statement in the paper. The abstract also gives no error bars or dataset-level breakdowns, which makes the 6.3 pp figure harder to interpret without the full tables.

This is aimed at people working on model compression and efficient inference. It is empirical, addresses a real deployment pain point, and shows enough controlled experiments to merit referee time even if the baseline comparison needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a framework for converting trained MoE models to dense architectures: experts are scored (including a novel diversity-aware method), selected, grouped, concatenated into a dense FFN, and refined via knowledge distillation from the MoE teacher. Across 350 configurations on Qwen3-30B-A3B (plus evaluations on DeepSeek-V2-Lite and GPT-OSS-20B), scoring method is identified as most impactful. Under matched parameter count, MoE-to-dense yields +6.3 pp higher average downstream accuracy than dense-to-dense pruning after ~4B-token distillation, at 1.6x faster wall-clock training time.

Significance. If the comparison is properly controlled, the work offers a practical route to deploy MoE-derived capabilities in memory-constrained dense models. The scale of 350 configurations evaluated provides substantial empirical coverage of design choices, which is a strength for an applied compression study.

major comments (2)

[Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.
[Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.

minor comments (2)

[Abstract] Abstract and experimental protocol: downstream task names, dataset splits, and evaluation settings are not listed, hindering direct reproduction of the accuracy numbers.
The paper evaluates three MoE models but does not discuss whether the identified scoring/grouping hyperparameters transfer without retuning to other MoE families or larger scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.

Authors: We confirm that the dense-to-dense pruning baseline receives identical knowledge distillation from the MoE teacher using the same ~4B-token budget, ensuring the comparison is controlled. The reported gain is therefore attributable to the MoE-to-dense pipeline. We will revise the abstract to state this explicitly. revision: yes
Referee: [Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.

Authors: We agree that variance reporting would improve assessment of the results. The scale of 350 configurations made multiple seeds per run computationally prohibitive. In the revision we will add standard deviations from repeated runs on the primary configurations and the headline comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper describes an empirical pipeline of scoring, grouping, concatenation, and distillation evaluated across 350 configurations on three models. No equations, uniqueness theorems, ansatzes, or predictions are presented that could reduce to inputs by construction. All reported gains (e.g., +6.3 pp) are direct experimental outcomes at matched parameter counts after fixed-token distillation. No self-citations are load-bearing for any central claim, and the work contains no mathematical derivation chain. This is a standard non-finding for an empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or invented entities; the work consists of empirical comparisons of pruning and distillation heuristics.

pith-pipeline@v0.9.1-grok · 5770 in / 1110 out tokens · 43181 ms · 2026-06-29T12:36:13.279509+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 15 internal anchors

[1]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

S. Bai, H. Li, J. Zhang, Z. Hong, and S. Guo. DiEP : Adaptive mixture-of-experts compression through differentiable expert pruning. arXiv preprint arXiv:2509.16105, 2025

work page arXiv 2025
[3]

Chen, H.-S

I.-C. Chen, H.-S. Liu, W.-F. Sun, C.-H. Chao, Y.-C. Hsu, and C.-Y. Lee. Retraining-free merging of sparse MoE via hierarchical clustering. In International Conference on Machine Learning, 2025

2025
[4]

T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, and F. Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277, 2022

work page arXiv 2022
[5]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE : Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI . Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026
[8]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 0 1--39, 2022

2022
[9]

Google DeepMind . Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026

2026
[10]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

S. Jha, M. Hashemzadeh, A. Saheb Pasand, A. Parviz, M.-J. Lee, and B. Knyazev. REAM : Merging improves pruning of experts in LLMs . arXiv preprint arXiv:2604.04356, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

B.-K. Kim, G. Kim, T.-H. Kim, T. Castells, S. Choi, J. Shin, and H.-K. Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods. arXiv preprint arXiv:2402.02834, 2024

work page arXiv 2024
[13]

G. Kim, G. Chu, and E. Yang. Every expert matters: Towards effective knowledge distillation for mixture-of-experts language models. arXiv preprint arXiv:2502.12947, 2025

work page arXiv 2025
[14]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression. arXiv preprint arXiv:2510.13999, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

L. Li, Z. Qiyuan, J. Wang, W. Li, H. Gu, S. Han, and Y. Guo. Sub-MoE : Efficient mixture-of-expert LLMs compression via subspace expert merging. arXiv preprint arXiv:2506.23266, 2025 a

work page arXiv 2025
[16]

P. Li, Z. Zhang, P. Yadav, Y.-L. Sung, Y. Cheng, M. Bansal, and T. Chen. Merge, then compress: Demystify efficient SMoE with hints from its routing policy. In International Conference on Learning Representations, 2024

2024
[17]

Z. Li, C. Liang, Z. Zhang, I. Hong, Y. J. Kim, W. Chen, and T. Zhao. SlimMoE : Structured compression of large MoE models via expert slimming and distillation. arXiv preprint arXiv:2506.18349, 2025 b

work page arXiv 2025
[18]

A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

2025
[21]

R. Miao, Y. Yao, Z. Wang, Z. Wang, B. Yi, L. Liu, Y. Zhao, and T. Yang. MergeMoE : Efficient compression of MoE models via expert output merging. arXiv preprint arXiv:2510.14436, 2025

work page arXiv 2025
[22]

Muralidharan, S

S. Muralidharan, S. T. Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024

work page arXiv 2024
[23]

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions--- I . Mathematical Programming, 14 0 (1): 0 265--294, 1978

1978
[24]

D. V. Nguyen, A. T. Nguyen, M. H. Nguyen, L. Q. Nguyen, S. Jiang, E. Fetaya, L. D. Tran, G. Chechik, and T. M. Nguyen. Expert merging in sparse mixture of experts with nash bargaining. arXiv preprint arXiv:2510.16138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Penedo, H

G. Penedo, H. Kydl \' c ek, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

2024
[26]

Pukelsheim

F. Pukelsheim. Optimal Design of Experiments. SIAM, 2006

2006
[27]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference, pages 606--610, 2007

2007
[28]

N. Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[29]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

S. T. Sreenivas, S. Muralidharan, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. LLM pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

work page arXiv 2024
[31]

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

R. Wang, A. Bhagia, and S. Min. EMO : Pretraining mixture of experts for emergent modularity. arXiv preprint arXiv:2605.06663, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

M. Xia, T. Gao, Z. Zeng, and D. Chen. Sheared LLaMA : Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2024

work page arXiv 2024
[34]

Y. Xie, Z. Zhang, D. Zhou, C. Xie, Z. Song, X. Liu, Y. Wang, X. Lin, and A. Xu. MoE-Pruner : Pruning mixture-of-experts large language model using the hints from its router. arXiv preprint arXiv:2410.12013, 2024

work page arXiv 2024
[35]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, Y. Duan, W. Jia, M. Yin, Y. Cheng, and B. Yuan. MoE-I ^2 : Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. In Findings of EMNLP, 2024

2024
[37]

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Y. Zhao, Z. Wang, and M. Zhang. PuzzleMoE : Efficient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference. arXiv preprint arXiv:2511.04805, 2025

work page arXiv 2025

[1] [1]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

S. Bai, H. Li, J. Zhang, Z. Hong, and S. Guo. DiEP : Adaptive mixture-of-experts compression through differentiable expert pruning. arXiv preprint arXiv:2509.16105, 2025

work page arXiv 2025

[3] [3]

Chen, H.-S

I.-C. Chen, H.-S. Liu, W.-F. Sun, C.-H. Chao, Y.-C. Hsu, and C.-Y. Lee. Retraining-free merging of sparse MoE via hierarchical clustering. In International Conference on Machine Learning, 2025

2025

[4] [4]

T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, and F. Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277, 2022

work page arXiv 2022

[5] [5]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE : Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-AI . Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

2026

[8] [8]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 0 1--39, 2022

2022

[9] [9]

Google DeepMind . Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026

2026

[10] [10]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

S. Jha, M. Hashemzadeh, A. Saheb Pasand, A. Parviz, M.-J. Lee, and B. Knyazev. REAM : Merging improves pruning of experts in LLMs . arXiv preprint arXiv:2604.04356, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

B.-K. Kim, G. Kim, T.-H. Kim, T. Castells, S. Choi, J. Shin, and H.-K. Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods. arXiv preprint arXiv:2402.02834, 2024

work page arXiv 2024

[13] [13]

G. Kim, G. Chu, and E. Yang. Every expert matters: Towards effective knowledge distillation for mixture-of-experts language models. arXiv preprint arXiv:2502.12947, 2025

work page arXiv 2025

[14] [14]

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression. arXiv preprint arXiv:2510.13999, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

L. Li, Z. Qiyuan, J. Wang, W. Li, H. Gu, S. Han, and Y. Guo. Sub-MoE : Efficient mixture-of-expert LLMs compression via subspace expert merging. arXiv preprint arXiv:2506.23266, 2025 a

work page arXiv 2025

[16] [16]

P. Li, Z. Zhang, P. Yadav, Y.-L. Sung, Y. Cheng, M. Bansal, and T. Chen. Merge, then compress: Demystify efficient SMoE with hints from its routing policy. In International Conference on Learning Representations, 2024

2024

[17] [17]

Z. Li, C. Liang, Z. Zhang, I. Hong, Y. J. Kim, W. Chen, and T. Zhao. SlimMoE : Structured compression of large MoE models via expert slimming and distillation. arXiv preprint arXiv:2506.18349, 2025 b

work page arXiv 2025

[18] [18]

A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [20]

The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

2025

[21] [21]

R. Miao, Y. Yao, Z. Wang, Z. Wang, B. Yi, L. Liu, Y. Zhao, and T. Yang. MergeMoE : Efficient compression of MoE models via expert output merging. arXiv preprint arXiv:2510.14436, 2025

work page arXiv 2025

[22] [22]

Muralidharan, S

S. Muralidharan, S. T. Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024

work page arXiv 2024

[23] [23]

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions--- I . Mathematical Programming, 14 0 (1): 0 265--294, 1978

1978

[24] [24]

D. V. Nguyen, A. T. Nguyen, M. H. Nguyen, L. Q. Nguyen, S. Jiang, E. Fetaya, L. D. Tran, G. Chechik, and T. M. Nguyen. Expert merging in sparse mixture of experts with nash bargaining. arXiv preprint arXiv:2510.16138, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Penedo, H

G. Penedo, H. Kydl \' c ek, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

2024

[26] [26]

Pukelsheim

F. Pukelsheim. Optimal Design of Experiments. SIAM, 2006

2006

[27] [27]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference, pages 606--610, 2007

2007

[28] [28]

N. Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[29] [29]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

S. T. Sreenivas, S. Muralidharan, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. LLM pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

work page arXiv 2024

[31] [31]

M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

R. Wang, A. Bhagia, and S. Min. EMO : Pretraining mixture of experts for emergent modularity. arXiv preprint arXiv:2605.06663, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[33] [33]

M. Xia, T. Gao, Z. Zeng, and D. Chen. Sheared LLaMA : Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2024

work page arXiv 2024

[34] [34]

Y. Xie, Z. Zhang, D. Zhou, C. Xie, Z. Song, X. Liu, Y. Wang, X. Lin, and A. Xu. MoE-Pruner : Pruning mixture-of-experts large language model using the hints from its router. arXiv preprint arXiv:2410.12013, 2024

work page arXiv 2024

[35] [35]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, Y. Duan, W. Jia, M. Yin, Y. Cheng, and B. Yuan. MoE-I ^2 : Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. In Findings of EMNLP, 2024

2024

[37] [37]

A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Y. Zhao, Z. Wang, and M. Zhang. PuzzleMoE : Efficient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference. arXiv preprint arXiv:2511.04805, 2025

work page arXiv 2025