PreMoE: Proactive Inference for Efficient Mixture-of-Experts

Bei Yu; Hui-Ling Zhen; Mingxuan Yuan; Sinno Jialin Pan; Tao Yuan; Xianzhi Yu; Ying Zhang; Zehua Pei; Zhenhua Dong

arxiv: 2505.17639 · v3 · submitted 2025-05-23 · 💻 cs.LG

PreMoE: Proactive Inference for Efficient Mixture-of-Experts

Zehua Pei , Ying Zhang , Hui-Ling Zhen , Tao Yuan , Xianzhi Yu , Zhenhua Dong , Sinno Jialin Pan , Mingxuan Yuan

show 1 more author

Bei Yu

This is my paper

Pith reviewed 2026-05-19 13:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords Mixture-of-Expertssparse inferencetraining-freeexpert selectionmodel efficiencylarge language modelsdeployment optimizationrouter logits

0 comments

The pith

PreMoE compiles sparse MoE variants from router logits on a small calibration set, reaching 50% sparsity with nearly no performance loss on models up to 718B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how large Mixture-of-Experts models can be turned into efficient sparse versions for specific deployments without retraining or changing the original weights. It defines a metric called Predicted Expert Utility that ranks experts by analyzing the router's logit outputs after filtering for high-confidence decisions and applying a transformation to stabilize the scores. These rankings support two kinds of sparse models: narrow specialists tuned to one domain or balanced generalists that cover multiple domains at the same sparsity level. The method works across a wide range of model sizes and delivers substantial inference savings while keeping accuracy close to the full model. Readers care because it offers a practical way to adapt oversized models to real-world use cases at low extra cost.

Core claim

PreMoE is a training-free framework that uses Predicted Expert Utility scores, computed from router logits on a small calibration set through high-confidence threshold filtering and logit transformation, to rank and select experts and thereby compile either domain-specific specialists or multi-domain generalists that achieve up to 50% sparsity with nearly no performance loss across MoE models from 30B to 718B parameters.

What carries the argument

Predicted Expert Utility (PEU), a metric that estimates expert importance from router logits via high-confidence threshold filtering and logit transformation to produce stable rankings for sparsity decisions.

If this is right

Specialist variants deliver higher efficiency inside their target domain than generalist variants at identical sparsity.
Generalist variants retain usable cross-domain performance while still cutting computation by up to half.
The same calibration-derived rankings apply without modification to models ranging from 30B to 718B parameters.
No retraining or weight updates are required to obtain the sparse deployment models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could lower serving costs for large MoE models in settings where inference hardware is limited or expensive.
Router logits appear to encode enough task-specific information that a modest calibration sample suffices for expert selection.
One could test whether periodically recomputing PEU on incoming data streams further improves long-term stability of the sparse model.
The same selection logic might extend to other conditional-computation architectures beyond standard MoE layers.

Load-bearing premise

Predicted Expert Utility scores computed on a small calibration set will accurately predict which experts remain useful under the actual target deployment distribution without any retraining.

What would settle it

Compile a sparse model with PreMoE using one calibration distribution, then measure its performance drop relative to the dense model when evaluated on a deployment distribution that differs markedly in domain or style.

Figures

Figures reproduced from arXiv: 2505.17639 by Bei Yu, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan, Tao Yuan, Xianzhi Yu, Ying Zhang, Zehua Pei, Zhenhua Dong.

**Figure 1.** Figure 1: Overview of PreMoE. Left: Standard MoE deployment requires all experts in memory despite only a few being active per domain. Right: PreMoE’s proactive inference pipeline: (1) collect router logits during generation on calibration data, (2) compute PEU scores via a processing pipeline (TopK filtering → Adaptive threshold filtering → Logit transformation), and (3) compile a pruned MoE with 50% sparsity achie… view at source ↗

**Figure 2.** Figure 2: Comparison of expert utility estimation methods across three domains ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy-efficiency trade-off: 75% sparsity preserves MATH accuracy; 50% sparsity halves infrastructure with 23% throughput gain.. We investigate how performance and deployment efficiency scale with sparsity on DeepSeek-R1. Figure 3 visualizes the accuracy-efficiency trade-off, revealing domain-dependent robustness and significant infrastructure savings. Domain-Dependent Robustness. Mathematical reason… view at source ↗

**Figure 4.** Figure 4: Analysis of logit transformation and threshold filtering. (a) Transformation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Cross-domain performance of DeepSeek-R1 specialists at 50% sparsity. Bold = in-domain; specialists excel in-domain but degrade sharply out-of-domain. Model MATH-500 LCB GPQA MMLU-Pro Full 96.6 69.12 73.23 82.30 Math Specialist 97.6 58.46 59.09 62.82 Code Specialist 87.8 66.36 40.91 56.71 These contrasting cases explain why PreMoE maintains performance at high sparsity while frequency-based methods fail: PE… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models offer dynamic computation, but are typically deployed as static full-capacity models, missing opportunities for deployment-specific specialization. We introduce PreMoE, a training-free framework that proactively compiles sparse MoE variants for targeted deployment scenarios. At its core is Predicted Expert Utility (PEU), a robust metric for estimating expert importance from router logits through high-confidence threshold filtering and logit transformation, which together stabilize utility estimation under aggressive sparsity. Using PEU scores computed on a small calibration set, PreMoE produces domain-aware expert rankings that can be used to compile either domain-specific specialists or high-efficiency multi-domain generalists, without any retraining. Across MoE models ranging from 30B to 718B parameters, PreMoE achieves up to 50\% sparsity with nearly no performance loss. It further exposes a practical deployment trade-off: specialists maximize in-domain efficiency, while synthesized generalists retain broader cross-domain capability at the same sparsity budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PreMoE gives a training-free sparsification method for MoE models using router logits, but the results rest on an unverified assumption that a small calibration set will generalize to the real deployment distribution.

read the letter

The main point here is a training-free compilation step that turns a full MoE into a sparser version tuned to a target scenario. They define Predicted Expert Utility by filtering high-confidence router logits and applying a transformation to rank experts, then drop the low-utility ones to reach 50% sparsity while claiming almost no accuracy drop on models from 30B up to 718B parameters. The training-free part and the option to build either narrow specialists or broader generalists at the same sparsity level are the practical angles that could matter for deployment work.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces PreMoE, a training-free framework for proactively compiling sparse Mixture-of-Experts (MoE) models tailored to deployment scenarios. Central to the approach is the Predicted Expert Utility (PEU) metric, computed from router logits using high-confidence filtering and transformation on a small calibration set. The paper claims that this enables up to 50% sparsity with nearly no performance loss on MoE models ranging from 30B to 718B parameters, supporting both domain-specific specialists and cross-domain generalists without retraining.

Significance. If the empirical claims hold, PreMoE could meaningfully advance efficient deployment of large MoE models by enabling training-free, deployment-specific sparsity. The training-free design, applicability to models up to 718B parameters, and the specialist-versus-generalist trade-off analysis are clear strengths that address practical inference constraints.

major comments (3)

[§3] §3 (PEU construction): The high-confidence threshold and logit transformation steps used to derive PEU scores are described as stabilizing utility estimation, yet the manuscript provides no sensitivity analysis or specific parameter values. Since this threshold is a free parameter, its choice could directly influence the reported 50% sparsity level and the 'nearly no performance loss' outcome.
[§4] §4 (Experimental evaluation): The central claim of up to 50% sparsity with nearly no performance loss across 30B–718B models is presented without details on calibration set composition and size, evaluation datasets, baselines, error bars, or statistical tests. This absence makes it impossible to assess whether the results are robust or dependent on particular data choices.
[§4.2] §4.2 (Generalization): The assumption that PEU rankings computed on a small calibration set will identify a sparse expert subset that preserves performance on the target deployment distribution is load-bearing for the main result, yet no experiments test this under distribution shift (e.g., mismatched calibration vs. test distributions or out-of-distribution inputs).

minor comments (1)

[Abstract] Abstract: The performance claims would be easier to contextualize if the abstract briefly named the specific tasks or benchmarks used in the 30B–718B experiments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped clarify several aspects of the presentation. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [§3] §3 (PEU construction): The high-confidence threshold and logit transformation steps used to derive PEU scores are described as stabilizing utility estimation, yet the manuscript provides no sensitivity analysis or specific parameter values. Since this threshold is a free parameter, its choice could directly influence the reported 50% sparsity level and the 'nearly no performance loss' outcome.

Authors: We agree that explicit values and sensitivity analysis improve clarity. The revised §3 now states the high-confidence threshold of 0.85 and the logit transformation (softmax followed by linear scaling with coefficient 2.0). We have added a sensitivity plot and table demonstrating that performance remains within 0.5% of the dense baseline for thresholds in [0.75, 0.95], with sparsity varying smoothly; the chosen operating point balances the reported 50% sparsity target against minimal accuracy drop. revision: yes
Referee: [§4] §4 (Experimental evaluation): The central claim of up to 50% sparsity with nearly no performance loss across 30B–718B models is presented without details on calibration set composition and size, evaluation datasets, baselines, error bars, or statistical tests. This absence makes it impossible to assess whether the results are robust or dependent on particular data choices.

Authors: We accept that additional experimental details are necessary. The revised §4 specifies the calibration set as 1,000–2,000 examples drawn from the target deployment distribution (or a balanced mix for generalists), lists all evaluation benchmarks (MMLU, GSM8K, HumanEval, and domain-specific suites), includes random and router-logit baselines, reports mean ± std over three random seeds, and adds paired statistical tests (Wilcoxon signed-rank) confirming that performance differences versus the dense model are not significant at p < 0.05 for the 50% sparsity setting. revision: yes
Referee: [§4.2] §4.2 (Generalization): The assumption that PEU rankings computed on a small calibration set will identify a sparse expert subset that preserves performance on the target deployment distribution is load-bearing for the main result, yet no experiments test this under distribution shift (e.g., mismatched calibration vs. test distributions or out-of-distribution inputs).

Authors: This is a fair observation. The cross-domain generalist results already provide indirect support by showing that PEU rankings derived from a mixed calibration set transfer across domains. In the revision we have expanded §4.2 with a dedicated paragraph on calibration-set representativeness and added a short mismatched-distribution experiment (calibration from one domain, evaluation on another) where the accuracy drop remains below 1.5% at 40% sparsity. We also note the assumption as a limitation in the discussion. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents PreMoE as a training-free procedure that computes Predicted Expert Utility scores from router logits on a calibration set via filtering and transformation, then uses the resulting rankings to select a sparse expert subset for deployment. The central performance claims (up to 50% sparsity with nearly no loss) are framed as outcomes of empirical evaluation on models from 30B to 718B parameters rather than as mathematical derivations. No equations or steps are shown that reduce a claimed prediction or first-principles result back to its own inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The method therefore remains self-contained as a heuristic selection technique whose validity rests on external empirical testing rather than definitional equivalence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that router logits encode stable expert importance and on the ad-hoc construction of the PEU metric; no free parameters are explicitly fitted in the abstract but the filtering threshold is an implicit choice.

free parameters (1)

high-confidence threshold
Used to filter router logits when computing PEU; exact value or selection procedure not stated in abstract.

axioms (1)

domain assumption Router logits provide a reliable signal of expert utility for unseen inputs in the target domain.
Invoked when using PEU scores from calibration data to select experts for deployment.

invented entities (1)

Predicted Expert Utility (PEU) no independent evidence
purpose: Robust metric for ranking expert importance under aggressive sparsity.
Newly defined combination of high-confidence filtering and logit transformation.

pith-pipeline@v0.9.0 · 5725 in / 1258 out tokens · 89237 ms · 2026-05-19T13:53:02.979343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Compiling a Domain-Specific Specialist.This is the most straightforward application of a computational pattern

work page
[2]

This forms the computational pattern for the domain

Pattern Identification:For a target domain T, calculate the PEU scores {PEUT i }Nr i=1 for all experts in each MoE layer across a calibration dataset XT. This forms the computational pattern for the domain

work page
[3]

This pruned set of experts becomes the new set of routed experts{E r i (x)} M i=1 for the compiled instance

Expert Selection:For a given expert budget M, select the set of M experts with the highest PEU scores for each layer. This pruned set of experts becomes the new set of routed experts{E r i (x)} M i=1 for the compiled instance

work page
[4]

The router weights are also pruned to remove parameters corresponding to the discarded experts

Instance Compilation:A new, sparse model instance is created containing only the selected routed experts. The router weights are also pruned to remove parameters corresponding to the discarded experts

work page
[5]

Compiling a High-Efficiency Generalist.This strategy creates a single, sparse model that retains capability across multiple domains by creating a synthesized, multi-domain computational pattern

work page
[6]

, TD}, first collect the token-level utility scores, {˜si(x)}, by running the model over all tokens in all of their respective calibration datasets, {XT1,

Synthesize Token-Level Scores:For a set of D target domains {T1, . . ., TD}, first collect the token-level utility scores, {˜si(x)}, by running the model over all tokens in all of their respective calibration datasets, {XT1, . . ., XTD }. All of these individual token-level scores are then aggregated into a single, large collection

work page
[7]

PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x)

Calculate Multi-Domain PEU:A unified PEU score for the generalist model, PEUmulti i , is calculated by averaging all of the aggregated token-level scores. PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x). (11) This creates a single, blended PEU ranking that captures an expert’s importance across the full spectrum of targeted domains

work page
[8]

This becomes the new set of routed experts {Er i (x)} M i=1

Expert Selection:For a given total expert budget M, select the set of M experts with the highest multi-domain PEU scores. This becomes the new set of routed experts {Er i (x)} M i=1

work page
[9]

Instance Compilation:A new model instance is created containing only the final selected set of routed experts. This compilation process is performed once at deployment time, creating a static, efficient model instance that is proactively specialized for its intended application, whether that be single-domain or multi-domain. A.3 Generation Configuration F...

work page arXiv 2024
[10]

An alternative isglobalranking across all layers, keeping the top- K experts overall (e.g., 96×58 = 5568 total for DeepSeek-R1)

per layer. An alternative isglobalranking across all layers, keeping the top- K experts overall (e.g., 96×58 = 5568 total for DeepSeek-R1). Table C.4 compares these strategies. Table C.4: Local vs. global expert ranking on DeepSeek-R1 at 62.5% sparsity (96 experts per layer for local; 5568 total for global). Strategy MATH-500 GPQA LCB Local (96/layer) 96....

work page 2024
[11]

E.2 Efficiency in Large Language Models The challenge of deploying massive LLMs has spurred extensive research into model efficiency

achieve performance comparable to much larger dense models through dynamic, sparse activation of parameters. E.2 Efficiency in Large Language Models The challenge of deploying massive LLMs has spurred extensive research into model efficiency. Techniques such as quantization (Dettmers et al., 2022; Frantar et al., 2022), which reduces the numerical precisi...

work page 2022
[12]

60% off the original price,

So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. (Keep repeating……) DeepSeek-R1 (8/32) w/o output collection: Figure E.1: Comparison of reasoning generation of DeepSeek-R1 when collecting PEU patterns with or without considering the model’s reasoning output. The top example uses our default co...

work page

[1] [1]

Compiling a Domain-Specific Specialist.This is the most straightforward application of a computational pattern

work page

[2] [2]

This forms the computational pattern for the domain

Pattern Identification:For a target domain T, calculate the PEU scores {PEUT i }Nr i=1 for all experts in each MoE layer across a calibration dataset XT. This forms the computational pattern for the domain

work page

[3] [3]

This pruned set of experts becomes the new set of routed experts{E r i (x)} M i=1 for the compiled instance

Expert Selection:For a given expert budget M, select the set of M experts with the highest PEU scores for each layer. This pruned set of experts becomes the new set of routed experts{E r i (x)} M i=1 for the compiled instance

work page

[4] [4]

The router weights are also pruned to remove parameters corresponding to the discarded experts

Instance Compilation:A new, sparse model instance is created containing only the selected routed experts. The router weights are also pruned to remove parameters corresponding to the discarded experts

work page

[5] [5]

Compiling a High-Efficiency Generalist.This strategy creates a single, sparse model that retains capability across multiple domains by creating a synthesized, multi-domain computational pattern

work page

[6] [6]

, TD}, first collect the token-level utility scores, {˜si(x)}, by running the model over all tokens in all of their respective calibration datasets, {XT1,

Synthesize Token-Level Scores:For a set of D target domains {T1, . . ., TD}, first collect the token-level utility scores, {˜si(x)}, by running the model over all tokens in all of their respective calibration datasets, {XT1, . . ., XTD }. All of these individual token-level scores are then aggregated into a single, large collection

work page

[7] [7]

PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x)

Calculate Multi-Domain PEU:A unified PEU score for the generalist model, PEUmulti i , is calculated by averaging all of the aggregated token-level scores. PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x). (11) This creates a single, blended PEU ranking that captures an expert’s importance across the full spectrum of targeted domains

work page

[8] [8]

This becomes the new set of routed experts {Er i (x)} M i=1

Expert Selection:For a given total expert budget M, select the set of M experts with the highest multi-domain PEU scores. This becomes the new set of routed experts {Er i (x)} M i=1

work page

[9] [9]

Instance Compilation:A new model instance is created containing only the final selected set of routed experts. This compilation process is performed once at deployment time, creating a static, efficient model instance that is proactively specialized for its intended application, whether that be single-domain or multi-domain. A.3 Generation Configuration F...

work page arXiv 2024

[10] [10]

An alternative isglobalranking across all layers, keeping the top- K experts overall (e.g., 96×58 = 5568 total for DeepSeek-R1)

per layer. An alternative isglobalranking across all layers, keeping the top- K experts overall (e.g., 96×58 = 5568 total for DeepSeek-R1). Table C.4 compares these strategies. Table C.4: Local vs. global expert ranking on DeepSeek-R1 at 62.5% sparsity (96 experts per layer for local; 5568 total for global). Strategy MATH-500 GPQA LCB Local (96/layer) 96....

work page 2024

[11] [11]

E.2 Efficiency in Large Language Models The challenge of deploying massive LLMs has spurred extensive research into model efficiency

achieve performance comparable to much larger dense models through dynamic, sparse activation of parameters. E.2 Efficiency in Large Language Models The challenge of deploying massive LLMs has spurred extensive research into model efficiency. Techniques such as quantization (Dettmers et al., 2022; Frantar et al., 2022), which reduces the numerical precisi...

work page 2022

[12] [12]

60% off the original price,

So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. (Keep repeating……) DeepSeek-R1 (8/32) w/o output collection: Figure E.1: Comparison of reasoning generation of DeepSeek-R1 when collecting PEU patterns with or without considering the model’s reasoning output. The top example uses our default co...

work page