pith. sign in

arxiv: 2605.28207 · v2 · pith:SLIAF6AQnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI· cs.LG

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Pith reviewed 2026-06-29 12:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mixture of expertsmodel compressionknowledge distillationdense modelsexpert selectionlanguage model pruning
0
0 comments X

The pith

Converting a trained Mixture-of-Experts model to a dense model by scoring, grouping and distilling experts outperforms pruning a dense model by 6.3 percentage points at matched size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a systematic way to turn a trained MoE language model into an ordinary dense model. Experts receive scores, the best ones are chosen and grouped, their weights are concatenated into a single feed-forward layer, and the whole network is refined by distilling knowledge from the original MoE. The authors test seven scoring methods, five grouping strategies and two scaling approaches on three different MoE models, running hundreds of configurations. Diversity-aware scoring proves most effective, and under controlled matched-parameter comparisons the distilled dense model reaches 6.3 points higher average downstream accuracy than a dense model obtained by pruning another dense model, while the distillation phase finishes 1.6 times faster.

Core claim

A trained MoE can be converted into a standard dense model by scoring experts, selecting and grouping them, concatenating their parameters into a dense FFN and distilling from the MoE teacher; the resulting dense model exceeds the downstream accuracy of a dense model produced by pruning a dense teacher at the same parameter count.

What carries the argument

Diversity-aware scoring of experts, which ranks them to maximize coverage of distinct knowledge before grouping and distillation into a dense feed-forward network.

If this is right

  • Frontier MoE models become usable in memory-limited settings without having to load every expert at inference time.
  • Knowledge distillation from an MoE teacher trains to target accuracy faster than distillation from a dense teacher of equal size.
  • Scoring method dominates performance among the tested design choices across Qwen3-30B-A3B, DeepSeek-V2-Lite and GPT-OSS-20B.
  • The conversion leaves a fully dense model whose inference cost and memory footprint match those of any standard transformer of the same width.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could let practitioners run large MoE checkpoints on edge devices by shipping only the distilled dense version.
  • Task-specific re-scoring of experts after initial conversion might recover additional accuracy without retraining the entire student.
  • If the diversity signal generalizes, future MoE training runs could deliberately encourage expert specialization knowing that excess experts can later be folded into dense layers.

Load-bearing premise

The expert selection and grouping choices identified on the three evaluated models will produce comparable gains on other MoE architectures or larger scales without additional hyper-parameter search.

What would settle it

Apply the same scoring-grouping-distillation pipeline to a fourth, previously unseen MoE model and observe that the final dense student underperforms a matched-size dense-to-dense pruned baseline by more than three points on the same downstream suite.

read the original abstract

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a framework for converting trained MoE models to dense architectures: experts are scored (including a novel diversity-aware method), selected, grouped, concatenated into a dense FFN, and refined via knowledge distillation from the MoE teacher. Across 350 configurations on Qwen3-30B-A3B (plus evaluations on DeepSeek-V2-Lite and GPT-OSS-20B), scoring method is identified as most impactful. Under matched parameter count, MoE-to-dense yields +6.3 pp higher average downstream accuracy than dense-to-dense pruning after ~4B-token distillation, at 1.6x faster wall-clock training time.

Significance. If the comparison is properly controlled, the work offers a practical route to deploy MoE-derived capabilities in memory-constrained dense models. The scale of 350 configurations evaluated provides substantial empirical coverage of design choices, which is a strength for an applied compression study.

major comments (2)
  1. [Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.
  2. [Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.
minor comments (2)
  1. [Abstract] Abstract and experimental protocol: downstream task names, dataset splits, and evaluation settings are not listed, hindering direct reproduction of the accuracy numbers.
  2. The paper evaluates three MoE models but does not discuss whether the identified scoring/grouping hyperparameters transfer without retuning to other MoE families or larger scales.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a '+6.3 pp' gain 'under a controlled comparison at matched parameter count' after '~4B-token distillation' does not state whether the dense-to-dense pruning baseline receives equivalent distillation from the MoE teacher for the same token budget. Without this, the delta cannot be unambiguously attributed to the expert scoring/grouping steps rather than differences in the distillation protocol.

    Authors: We confirm that the dense-to-dense pruning baseline receives identical knowledge distillation from the MoE teacher using the same ~4B-token budget, ensuring the comparison is controlled. The reported gain is therefore attributable to the MoE-to-dense pipeline. We will revise the abstract to state this explicitly. revision: yes

  2. Referee: [Experiments] Experiments section (description of 350 configurations and downstream results): no error bars, standard deviations, or number of random seeds/runs are reported for any accuracy numbers, including the headline +6.3 pp delta. This undermines assessment of whether the reported gains exceed typical run-to-run variance in language-model fine-tuning.

    Authors: We agree that variance reporting would improve assessment of the results. The scale of 350 configurations made multiple seeds per run computationally prohibitive. In the revision we will add standard deviations from repeated runs on the primary configurations and the headline comparison. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper describes an empirical pipeline of scoring, grouping, concatenation, and distillation evaluated across 350 configurations on three models. No equations, uniqueness theorems, ansatzes, or predictions are presented that could reduce to inputs by construction. All reported gains (e.g., +6.3 pp) are direct experimental outcomes at matched parameter counts after fixed-token distillation. No self-citations are load-bearing for any central claim, and the work contains no mathematical derivation chain. This is a standard non-finding for an empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or invented entities; the work consists of empirical comparisons of pruning and distillation heuristics.

pith-pipeline@v0.9.1-grok · 5770 in / 1110 out tokens · 43181 ms · 2026-06-29T12:36:13.279509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages · 15 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025

  2. [2]

    S. Bai, H. Li, J. Zhang, Z. Hong, and S. Guo. DiEP : Adaptive mixture-of-experts compression through differentiable expert pruning. arXiv preprint arXiv:2509.16105, 2025

  3. [3]

    Chen, H.-S

    I.-C. Chen, H.-S. Liu, W.-F. Sun, C.-H. Chao, Y.-C. Hsu, and C.-Y. Lee. Retraining-free merging of sparse MoE via hierarchical clustering. In International Conference on Machine Learning, 2025

  4. [4]

    T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, and F. Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277, 2022

  5. [5]

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE : Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024

  6. [6]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024

  7. [7]

    Deepseek-v4: Towards highly efficient million-token context intelligence

    DeepSeek-AI . Deepseek-v4: Towards highly efficient million-token context intelligence. Technical report, 2026. URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

  8. [8]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23: 0 1--39, 2022

  9. [9]

    Google DeepMind . Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026

  10. [10]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  11. [11]

    S. Jha, M. Hashemzadeh, A. Saheb Pasand, A. Parviz, M.-J. Lee, and B. Knyazev. REAM : Merging improves pruning of experts in LLMs . arXiv preprint arXiv:2604.04356, 2026

  12. [12]

    B.-K. Kim, G. Kim, T.-H. Kim, T. Castells, S. Choi, J. Shin, and H.-K. Song. Shortened llama: Depth pruning for large language models with comparison of retraining methods. arXiv preprint arXiv:2402.02834, 2024

  13. [13]

    G. Kim, G. Chu, and E. Yang. Every expert matters: Towards effective knowledge distillation for mixture-of-experts language models. arXiv preprint arXiv:2502.12947, 2025

  14. [14]

    REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

    M. Lasby, I. Lazarevich, N. Sinnadurai, S. Lie, Y. Ioannou, and V. Thangarasa. Reap the experts: Why pruning prevails for one-shot moe compression. arXiv preprint arXiv:2510.13999, 2025

  15. [15]

    L. Li, Z. Qiyuan, J. Wang, W. Li, H. Gu, S. Han, and Y. Guo. Sub-MoE : Efficient mixture-of-expert LLMs compression via subspace expert merging. arXiv preprint arXiv:2506.23266, 2025 a

  16. [16]

    P. Li, Z. Zhang, P. Yadav, Y.-L. Sung, Y. Cheng, M. Bansal, and T. Chen. Merge, then compress: Demystify efficient SMoE with hints from its routing policy. In International Conference on Learning Representations, 2024

  17. [17]

    Z. Li, C. Liang, Z. Zhang, I. Hong, Y. J. Kim, W. Chen, and T. Zhao. SlimMoE : Structured compression of large MoE models via expert slimming and distillation. arXiv preprint arXiv:2506.18349, 2025 b

  18. [18]

    A. H. Liu, K. Khandelwal, S. Subramanian, V. Jouault, et al. Ministral 3. arXiv preprint arXiv:2601.08584, 2026

  19. [19]

    Pointer Sentinel Mixture Models

    S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2017

  20. [20]

    The llama 4 herd: The beginning of a new era of natively multimodal ai innovation

    Meta . The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2025

  21. [21]

    R. Miao, Y. Yao, Z. Wang, Z. Wang, B. Yi, L. Liu, Y. Zhao, and T. Yang. MergeMoE : Efficient compression of MoE models via expert output merging. arXiv preprint arXiv:2510.14436, 2025

  22. [22]

    Muralidharan, S

    S. Muralidharan, S. T. Sreenivas, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. Compact language models via pruning and knowledge distillation. arXiv preprint arXiv:2407.14679, 2024

  23. [23]

    G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions--- I . Mathematical Programming, 14 0 (1): 0 265--294, 1978

  24. [24]

    D. V. Nguyen, A. T. Nguyen, M. H. Nguyen, L. Q. Nguyen, S. Jiang, E. Fetaya, L. D. Tran, G. Chechik, and T. M. Nguyen. Expert merging in sparse mixture of experts with nash bargaining. arXiv preprint arXiv:2510.16138, 2025

  25. [25]

    Penedo, H

    G. Penedo, H. Kydl \' c ek, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37: 0 30811--30849, 2024

  26. [26]

    Pukelsheim

    F. Pukelsheim. Optimal Design of Experiments. SIAM, 2006

  27. [27]

    Roy and M

    O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference, pages 606--610, 2007

  28. [28]

    N. Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202, 2020

  29. [29]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  30. [30]

    S. T. Sreenivas, S. Muralidharan, R. Joshi, M. Chochowski, M. Patwary, M. Shoeybi, B. Catanzaro, J. Kautz, and P. Molchanov. LLM pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796, 2024

  31. [31]

    M. Sun, Z. Liu, A. Bair, and J. Z. Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

  32. [32]

    R. Wang, A. Bhagia, and S. Min. EMO : Pretraining mixture of experts for emergent modularity. arXiv preprint arXiv:2605.06663, 2026

  33. [33]

    M. Xia, T. Gao, Z. Zeng, and D. Chen. Sheared LLaMA : Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2024

  34. [34]

    Y. Xie, Z. Zhang, D. Zhou, C. Xie, Z. Song, X. Liu, Y. Wang, X. Lin, and A. Xu. MoE-Pruner : Pruning mixture-of-experts large language model using the hints from its router. arXiv preprint arXiv:2410.12013, 2024

  35. [35]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    C. Yang, Y. Sui, J. Xiao, L. Huang, Y. Gong, Y. Duan, W. Jia, M. Yin, Y. Cheng, and B. Yuan. MoE-I ^2 : Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. In Findings of EMNLP, 2024

  37. [37]

    A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  38. [38]

    Y. Zhao, Z. Wang, and M. Zhang. PuzzleMoE : Efficient compression of large mixture-of-experts models via sparse expert merging and bit-packed inference. arXiv preprint arXiv:2511.04805, 2025