ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Elsie Dai; Jiaming Pan; Peizhuang Cong; Tong Yang; Yaoming Li; Yilun Yao

arxiv: 2605.29350 · v1 · pith:ZI5R65QGnew · submitted 2026-05-28 · 💻 cs.AI

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Yilun Yao , Jiaming Pan , Elsie Dai , Peizhuang Cong , Yaoming Li , Tong Yang This is my paper

Pith reviewed 2026-06-29 07:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of expertsmodel compressionexpert consolidationprototype reassignmenttrain-free compressionpost-training methodslanguage model deployment

0 comments

The pith

MoE compression succeeds by keeping fewer experts as reusable prototypes and remapping the rest deterministically without weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that post-training compression of Mixture-of-Experts language models can be achieved by consolidating the expert pool into a smaller set of retained prototypes and then deterministically remapping every original expert slot to one of those prototypes. This formulation keeps the router unchanged and permits sharing of prototypes within each layer while avoiding any weight updates or fine-tuning after compression. A reader would care if true because it offers a direct way to cut the memory cost of storing and serving all experts in large MoE models. The selection of prototypes relies on contribution and replaceability signals computed from a small calibration set.

Core claim

We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-c

What carries the argument

expert-pool consolidation via prototype reassignment, which uses calibration signals to choose retained experts and then applies deterministic remapping of all original expert calls

If this is right

ConMoE matches or outperforms strong pruning and merging baselines in several settings on three pretrained MoE models.
It achieves the best average score on deepseek-moe-16b-base at both 25 percent and 50 percent routed-expert reduction.
It remains competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125 under the same reductions.
Deterministic reassignment is the most stable component of the method.
Broader cross-layer sharing and post-hoc weight fusion show model-dependent effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that many expert behaviors in these models overlap enough to be captured by a smaller prototype pool selected from the original set.
If the calibration signals prove reliable across more models, the same selection step could be reused for other compression ratios or for combining with quantization.
The separation of pool size from reuse structure opens a path to layer-specific sharing ratios without changing the router.
Applying the same remapping logic at inference time on hardware with limited expert cache could reduce loading costs.

Load-bearing premise

Calibration-based contribution and replaceability signals from a small dataset are enough to pick a reusable prototype set whose deterministic remapping preserves model behavior without weight updates or fine-tuning.

What would settle it

A clear performance drop on a held-out evaluation set after applying the remapping to a new pretrained MoE model, compared with the original model and with pruning baselines, would show the method does not preserve behavior.

Figures

Figures reproduced from arXiv: 2605.29350 by Elsie Dai, Jiaming Pan, Peizhuang Cong, Tong Yang, Yaoming Li, Yilun Yao.

**Figure 2.** Figure 2: Effect of scope size on Qwen3-30B-A3B at [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Cross-layer nearest-neighbor analysis on Qwen3-30B-A3B. Left: source-layer to nearest-neighbor-layer [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConMoE frames MoE compression as train-free prototype selection and deterministic remapping, with competitive summary results on DeepSeek-MoE but thin evidence overall.

read the letter

The main takeaway is that this paper treats post-training MoE compression as expert-pool consolidation: pick a smaller set of prototypes using calibration signals for contribution and replaceability, then remap the rest deterministically while keeping the original router and allowing local-layer sharing. No weight updates or fine-tuning required.

What stands out is the clean separation between the reduced pool and the reuse structure. That framing lets them avoid the usual merging or pruning trade-offs and stay fully train-free, which is useful for quick deployment on existing models. The experiments cover three pretrained MoEs and report that the method matches or beats strong baselines, with the best average on deepseek-moe-16b-base at both 25% and 50% reduction.

The soft spot is the evidence level. Everything rests on high-level claims with no tables, exact numbers, error bars, or dataset details visible. The central assumption—that small calibration sets produce reliable replaceability signals—remains untested in the summary we have, and expert activations can be input-dependent enough that this could break in practice. The stress-test concern about missing rare contexts lands here.

This is aimed at people working on efficient serving of production MoE models. A reader focused on post-training compression would get value from the formulation and the train-free angle. It deserves a serious referee to examine the full experimental setup and check whether the calibration signals actually deliver the claimed preservation of behavior.

Referee Report

3 major / 2 minor

Summary. The paper proposes ConMoE, a train-free post-training compression framework for Mixture-of-Experts (MoE) language models. It reformulates compression as expert-pool consolidation: a smaller set of pretrained experts is retained as prototypes based on calibration-derived contribution and replaceability signals, after which each original expert reference is deterministically remapped to one of the prototypes. This preserves the original router interface, permits local layer-scope prototype sharing, and requires no weight updates or fine-tuning. Experiments on DeepSeek-MoE-16B-Base, Qwen3-30B-A3B, and OLMoE-1B-7B-0125 report that ConMoE matches or exceeds strong pruning and merging baselines at 25% and 50% routed-expert reduction, attaining the best average score on DeepSeek-MoE-16B-Base at both ratios; ablations indicate deterministic reassignment is the most stable component while cross-layer sharing and weight fusion are model-dependent.

Significance. If the empirical claims hold under detailed scrutiny, the work supplies a lightweight, training-free compression technique that decouples the reduced expert pool from the reuse mapping and keeps the router unchanged. The explicit separation of prototype selection from remapping structure, together with the reported stability of the deterministic component, constitutes a practical contribution for memory-constrained MoE deployment. The purely empirical, calibration-driven nature and the component-wise ablations are strengths that allow direct comparison with existing pruning/merging methods.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the central claim that ConMoE 'matches or outperforms' baselines and achieves 'the best average score' on DeepSeek-MoE-16B-Base rests on high-level summary only; no numerical tables, exact baseline scores, dataset sizes, number of runs, or error bars are referenced, so the magnitude and statistical reliability of the reported gains cannot be assessed.
[§3] §3 (Method): the replaceability and contribution signals are computed on an unspecified small calibration corpus; because expert activations are known to be input-dependent, the manuscript must demonstrate that these proxies remain predictive of functional equivalence under deterministic remapping on held-out data, or the train-free preservation claim is unsupported.
[§4.3] §4.3 (Ablations): the statement that 'deterministic reassignment is the most stable component' is presented without quantitative comparison of performance variance across random seeds or across different calibration-set sizes; this directly affects the reliability of the core design choice.

minor comments (2)

[§3] Notation for the contribution and replaceability scores should be defined once with explicit formulas rather than described only in prose.
[§4] The manuscript should state the exact calibration corpus (size, domain, number of tokens) used for all reported runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve clarity, specificity, and empirical support in the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim that ConMoE 'matches or outperforms' baselines and achieves 'the best average score' on DeepSeek-MoE-16B-Base rests on high-level summary only; no numerical tables, exact baseline scores, dataset sizes, number of runs, or error bars are referenced, so the magnitude and statistical reliability of the reported gains cannot be assessed.

Authors: We agree that the abstract and opening of §4 would benefit from explicit references to numerical results. The full results appear in Tables 1–3 of §4 (including exact scores for ConMoE and all baselines on the three models, with dataset details in §4.1). In the revision we will (i) insert key numerical values and table references into the abstract and §4 introduction, (ii) state that all scores are averaged over three independent evaluation runs, and (iii) ensure standard deviations are reported in the tables where they were previously omitted from the summary text. revision: yes
Referee: [§3] §3 (Method): the replaceability and contribution signals are computed on an unspecified small calibration corpus; because expert activations are known to be input-dependent, the manuscript must demonstrate that these proxies remain predictive of functional equivalence under deterministic remapping on held-out data, or the train-free preservation claim is unsupported.

Authors: We will revise §3.2 to explicitly state the calibration corpus details (size, source, and sampling procedure). To address input dependence, we will add a controlled experiment that recomputes prototype selection on a held-out portion of the same data distribution and reports the resulting end-to-end performance after deterministic remapping. This will directly test whether the signals remain predictive outside the calibration set. revision: yes
Referee: [§4.3] §4.3 (Ablations): the statement that 'deterministic reassignment is the most stable component' is presented without quantitative comparison of performance variance across random seeds or across different calibration-set sizes; this directly affects the reliability of the core design choice.

Authors: We accept that the current ablation lacks quantitative variance metrics. In the revised §4.3 we will report mean and standard deviation of downstream scores across five random seeds for deterministic reassignment versus the compared alternatives, and we will repeat the ablation for calibration-set sizes of 256, 512, and 1024 samples. These additions will provide the requested quantitative support for the stability claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is purely empirical

full rationale

The paper describes a train-free empirical procedure for expert consolidation using calibration signals and deterministic remapping, evaluated via experiments on external pretrained models. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on external benchmarks rather than internal reduction to inputs. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivation, free parameters, axioms, or new postulated entities are introduced; the approach operates entirely on existing pretrained weights and calibration data.

pith-pipeline@v0.9.1-grok · 5787 in / 1140 out tokens · 26462 ms · 2026-06-29T07:53:40.669153+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Minbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu, Guoxuan Chen, Xintong Yu, Yichun Yin, and Hong Cheng. 2026. Unipool: A globally shared expert pool for mixture-of-experts.Preprint, arXiv:2605.06665. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanc...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[2]

Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, and Xin Yuan

Merge, then compress: Demystify efficient smoe with hints from its routing policy.Preprint, arXiv:2310.01334. Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, and Xin Yuan. 2026. Evoesap: Non- uniform expert pruning for sparse moe.Preprint, arXiv:2603.06003. Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li

work page arXiv 2026
[3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6159–6172, Bangkok, Thailand

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6159–6172, Bangkok, Thailand. Association for Computational Linguistics. Ruijie Miao, Yilun Yao, Zihan Wang, Zhiming Wang, ...

work page arXiv 2025

[1] [1]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.Preprint, arXiv:2009.03300. Minbin Huang, Han Shi, Chuanyang Zheng, Yimeng Wu, Guoxuan Chen, Xintong Yu, Yichun Yin, and Hong Cheng. 2026. Unipool: A globally shared expert pool for mixture-of-experts.Preprint, arXiv:2605.06665. Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanc...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[2] [2]

Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, and Xin Yuan

Merge, then compress: Demystify efficient smoe with hints from its routing policy.Preprint, arXiv:2310.01334. Zongfang Liu, Shengkun Tang, Boyang Sun, Zhiqiang Shen, and Xin Yuan. 2026. Evoesap: Non- uniform expert pruning for sparse moe.Preprint, arXiv:2603.06003. Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li

work page arXiv 2026

[3] [3]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6159–6172, Bangkok, Thailand

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 6159–6172, Bangkok, Thailand. Association for Computational Linguistics. Ruijie Miao, Yilun Yao, Zihan Wang, Zhiming Wang, ...

work page arXiv 2025