REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa · 2025 · cs.LG · arXiv 2510.13999

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

Model Compression with Exact Budget Constraints via Riemannian Manifolds

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

cs.LG · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

citing papers explorer

Showing 4 of 4 citing papers after filters.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts cs.LG · 2026-05-13 · unverdicted · none · ref 31 · internal anchor
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control cs.LG · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Model Compression with Exact Budget Constraints via Riemannian Manifolds cs.LG · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training cs.LG · 2026-05-09 · unverdicted · none · ref 40 · 2 links · internal anchor
Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer