REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Ivan Lazarevich; Mike Lasby; Nish Sinnadurai; Sean Lie; Vithursan Thangarasa; Yani Ioannou

arxiv: 2510.13999 · v3 · pith:3BTXRKHDnew · submitted 2025-10-15 · 💻 cs.LG · cs.AI

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Mike Lasby , Ivan Lazarevich , Nish Sinnadurai , Sean Lie , Yani Ioannou , Vithursan Thangarasa This is my paper

Pith reviewed 2026-05-18 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of expertsexpert pruningmodel compressiongenerative tasksrouting controlREAP

0 comments

The pith

Expert pruning outperforms merging for compressing Mixture-of-Experts models on generative tasks by preserving routing control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that pruning away selected experts in sparsely activated Mixture-of-Experts models works better than merging them when the goal is strong performance on text and code generation. Merging blends expert behaviors into averages that the router can no longer choose between, creating a permanent drop in selection precision. REAP instead scores each expert by multiplying its router gate value with its activation strength and drops the lowest-scoring ones, keeping the original routing mechanism untouched. The result is that models from 20 billion to a trillion parameters lose far less generative quality at 50 percent compression than they do under merging or other pruning rules.

Core claim

The authors establish that expert pruning is superior to expert merging for one-shot compression of sparsely-activated Mixture-of-Experts models on generative tasks. They identify an irreducible error in merging caused by the permanent loss of fine-grained router control over individual experts. Their Router-weighted Expert Activation Pruning method ranks experts using the product of router gate-values and expert activation norms, which keeps the routing layer intact while removing redundant capacity and thereby reduces the reconstruction error bound.

What carries the argument

Router-weighted Expert Activation Pruning (REAP), a scoring rule that multiplies router gate-values by expert activation norms to rank and remove experts while leaving the routing mechanism unchanged.

If this is right

REAP delivers higher generative benchmark scores than merging or prior pruning methods, with the gap largest at 50 percent compression.
Near-lossless performance holds on code generation for Qwen3-Coder-480B and Kimi-K2 after half the experts are removed.
The same advantage appears consistently across SMoE models ranging from 20B to 1T parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If routing precision is the decisive factor, then compression methods that leave the router untouched may be preferable for any task that depends on sharp expert selection.
The weighting idea could be tested on other sparse architectures where activation patterns matter more than averaged parameters.

Load-bearing premise

Merging experts always produces an irreducible loss of fine-grained routing control that cannot be recovered.

What would settle it

A head-to-head test on a code-generation benchmark that measures whether any merging method can ever match the exact per-input expert activation pattern produced by the original model after 50 percent compression.

read the original abstract

Sparsely-activated Mixture-of-Experts (SMoE) models offer efficient pre-training and low latency but their large parameter counts create significant memory overhead, motivating research into expert compression. Contrary to recent findings favouring expert merging on discriminative benchmarks, we find that expert pruning is a superior strategy for generative tasks. We demonstrate that existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts. Leveraging this insight, we propose Router-weighted Expert Activation Pruning (REAP), a novel pruning criterion that considers both router gate-values and expert activation norms to minimize the reconstruction error bound. Across a diverse set of SMoE models ranging from 20B to 1T parameters, REAP consistently outperforms merging and other pruning methods on generative benchmarks, especially at 50% compression. Notably, our method achieves near-lossless compression on code generation tasks with Qwen3-Coder-480B and Kimi-K2, even after pruning 50% of experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REAP pruning beats the merging baselines on generative tasks for these MoE models, but the claim that merging carries an irreducible routing error is asserted rather than shown.

read the letter

The paper's main result is that their REAP criterion, which multiplies expert activation norms by router gate values, produces stronger compression than merging or other pruning rules when the workload is generation rather than classification. They report this across 20B to 1T models and highlight near-lossless code-generation numbers after 50% expert removal on Qwen3-Coder-480B and Kimi-K2. That empirical pattern is the part worth paying attention to for anyone shipping large sparse models.

Referee Report

2 major / 1 minor

Summary. The paper claims that expert pruning is superior to expert merging for one-shot compression of sparsely-activated Mixture-of-Experts (SMoE) models on generative tasks. It attributes merging's underperformance to an irreducible reconstruction error caused by loss of fine-grained routing control. The authors introduce Router-weighted Expert Activation Pruning (REAP), a criterion combining router gate-values and expert activation norms to minimize this error bound, and report that REAP outperforms merging and other pruning baselines across 20B–1T parameter models, with near-lossless results on code generation for Qwen3-Coder-480B and Kimi-K2 at 50% compression.

Significance. If the empirical results hold under more rigorous validation, the work is significant for challenging the recent preference for merging in MoE compression literature and for offering a simple, router-aware pruning criterion that delivers strong generative-task performance at high compression ratios. The evaluation spans a wide range of model scales (20B to 1T) and includes concrete near-lossless outcomes on code-generation benchmarks; these are genuine strengths. The absence of a quantitative bound on the claimed irreducible merging error, however, leaves the central causal explanation under-supported.

major comments (2)

[Abstract] Abstract: the assertion that 'existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts' is presented without a quantitative bound, derivation, or comparison against alternative merging strategies that might recover routing granularity. This claim is load-bearing for the superiority argument at 50% compression.
[Abstract] Abstract and experimental results: the reported outperformance (including near-lossless code-generation results) is given without error bars, ablation details on the router-norm weighting coefficient, or statistical significance tests. This weakens confidence that the observed gaps are driven by the claimed routing-control mechanism rather than by the specific REAP criterion or baseline implementations.

minor comments (1)

The free parameter 'router-norm weighting coefficient' is listed but its selection procedure and sensitivity analysis are not described in the provided summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our claims and results.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'existing merging techniques introduce an irreducible error due to the loss of fine-grained routing control over experts' is presented without a quantitative bound, derivation, or comparison against alternative merging strategies that might recover routing granularity. This claim is load-bearing for the superiority argument at 50% compression.

Authors: We agree that making the quantitative aspect of the claim more explicit strengthens the paper. Section 3.2 of the manuscript already derives the reconstruction error for merging as the expected deviation arising from replacing per-token expert selection with a single averaged expert; this error is irreducible in the one-shot setting because the merged expert cannot be routed to selectively. In the revision we add an explicit bound on this error (the router-weighted L2 deviation between original and merged outputs) and include a short comparison noting that granularity-recovery techniques require either extra parameters or iterative optimization, placing them outside our one-shot constraint. We therefore revise the abstract to reference this bound while preserving conciseness. revision: yes
Referee: [Abstract] Abstract and experimental results: the reported outperformance (including near-lossless code-generation results) is given without error bars, ablation details on the router-norm weighting coefficient, or statistical significance tests. This weakens confidence that the observed gaps are driven by the claimed routing-control mechanism rather than by the specific REAP criterion or baseline implementations.

Authors: We acknowledge that the original submission omitted these elements. In the revised manuscript we report standard deviations over three independent runs for all main results, add an ablation table varying the router-norm coefficient (optimal at 0.5), and include paired t-test p-values confirming that REAP’s gains over merging are statistically significant at p < 0.05 on the code-generation tasks. These additions directly support that the performance difference stems from preservation of routing granularity rather than implementation specifics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method and benchmarks are self-contained

full rationale

The paper defines REAP directly as a pruning criterion using router gate-values and expert activation norms to bound reconstruction error, then reports empirical results on generative benchmarks for multiple models. No equations or steps reduce a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. The hypothesis that merging incurs irreducible error is motivated by observed gaps rather than derived from the method itself, and performance numbers are not statistically forced by internal fits. This matches the default case of an independent empirical comparison.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the modeling assumption that routing decisions are the dominant source of error after merging and that a linear combination of gate value and activation norm is a good proxy for reconstruction error. No explicit free parameters are named in the abstract, but the weighting between router and norm is implicitly chosen.

free parameters (1)

router-norm weighting coefficient
The method considers both router gate-values and expert activation norms; the relative weight between these two signals is not derived from first principles and must be selected.

axioms (1)

domain assumption merging experts necessarily loses fine-grained routing control
This premise is invoked to explain why pruning is superior; it is stated as a general property of merging techniques.

pith-pipeline@v0.9.0 · 5725 in / 1336 out tokens · 22721 ms · 2026-05-18T06:46:24.187379+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Irreducible error of merging). ... minimal error is E[(gi+gj)²]·Var[r(x)]·∥Δij∥²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control
cs.LG 2026-05 unverdicted novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Model Compression with Exact Budget Constraints via Riemannian Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

The budget constraint in discrete model compression defines a Riemannian manifold allowing exact-constraint first-order optimization via Riemannian Constrained Optimization (RCO) without extra hyperparameters.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
cs.LG 2026-03 conditional novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 5.0

Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the fina...