PreMoE: Proactive Inference for Efficient Mixture-of-Experts
Pith reviewed 2026-05-19 13:53 UTC · model grok-4.3
The pith
PreMoE compiles sparse MoE variants from router logits on a small calibration set, reaching 50% sparsity with nearly no performance loss on models up to 718B parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PreMoE is a training-free framework that uses Predicted Expert Utility scores, computed from router logits on a small calibration set through high-confidence threshold filtering and logit transformation, to rank and select experts and thereby compile either domain-specific specialists or multi-domain generalists that achieve up to 50% sparsity with nearly no performance loss across MoE models from 30B to 718B parameters.
What carries the argument
Predicted Expert Utility (PEU), a metric that estimates expert importance from router logits via high-confidence threshold filtering and logit transformation to produce stable rankings for sparsity decisions.
If this is right
- Specialist variants deliver higher efficiency inside their target domain than generalist variants at identical sparsity.
- Generalist variants retain usable cross-domain performance while still cutting computation by up to half.
- The same calibration-derived rankings apply without modification to models ranging from 30B to 718B parameters.
- No retraining or weight updates are required to obtain the sparse deployment models.
Where Pith is reading between the lines
- The approach could lower serving costs for large MoE models in settings where inference hardware is limited or expensive.
- Router logits appear to encode enough task-specific information that a modest calibration sample suffices for expert selection.
- One could test whether periodically recomputing PEU on incoming data streams further improves long-term stability of the sparse model.
- The same selection logic might extend to other conditional-computation architectures beyond standard MoE layers.
Load-bearing premise
Predicted Expert Utility scores computed on a small calibration set will accurately predict which experts remain useful under the actual target deployment distribution without any retraining.
What would settle it
Compile a sparse model with PreMoE using one calibration distribution, then measure its performance drop relative to the dense model when evaluated on a deployment distribution that differs markedly in domain or style.
Figures
read the original abstract
Mixture-of-Experts (MoE) models offer dynamic computation, but are typically deployed as static full-capacity models, missing opportunities for deployment-specific specialization. We introduce PreMoE, a training-free framework that proactively compiles sparse MoE variants for targeted deployment scenarios. At its core is Predicted Expert Utility (PEU), a robust metric for estimating expert importance from router logits through high-confidence threshold filtering and logit transformation, which together stabilize utility estimation under aggressive sparsity. Using PEU scores computed on a small calibration set, PreMoE produces domain-aware expert rankings that can be used to compile either domain-specific specialists or high-efficiency multi-domain generalists, without any retraining. Across MoE models ranging from 30B to 718B parameters, PreMoE achieves up to 50\% sparsity with nearly no performance loss. It further exposes a practical deployment trade-off: specialists maximize in-domain efficiency, while synthesized generalists retain broader cross-domain capability at the same sparsity budget.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PreMoE, a training-free framework for proactively compiling sparse Mixture-of-Experts (MoE) models tailored to deployment scenarios. Central to the approach is the Predicted Expert Utility (PEU) metric, computed from router logits using high-confidence filtering and transformation on a small calibration set. The paper claims that this enables up to 50% sparsity with nearly no performance loss on MoE models ranging from 30B to 718B parameters, supporting both domain-specific specialists and cross-domain generalists without retraining.
Significance. If the empirical claims hold, PreMoE could meaningfully advance efficient deployment of large MoE models by enabling training-free, deployment-specific sparsity. The training-free design, applicability to models up to 718B parameters, and the specialist-versus-generalist trade-off analysis are clear strengths that address practical inference constraints.
major comments (3)
- [§3] §3 (PEU construction): The high-confidence threshold and logit transformation steps used to derive PEU scores are described as stabilizing utility estimation, yet the manuscript provides no sensitivity analysis or specific parameter values. Since this threshold is a free parameter, its choice could directly influence the reported 50% sparsity level and the 'nearly no performance loss' outcome.
- [§4] §4 (Experimental evaluation): The central claim of up to 50% sparsity with nearly no performance loss across 30B–718B models is presented without details on calibration set composition and size, evaluation datasets, baselines, error bars, or statistical tests. This absence makes it impossible to assess whether the results are robust or dependent on particular data choices.
- [§4.2] §4.2 (Generalization): The assumption that PEU rankings computed on a small calibration set will identify a sparse expert subset that preserves performance on the target deployment distribution is load-bearing for the main result, yet no experiments test this under distribution shift (e.g., mismatched calibration vs. test distributions or out-of-distribution inputs).
minor comments (1)
- [Abstract] Abstract: The performance claims would be easier to contextualize if the abstract briefly named the specific tasks or benchmarks used in the 30B–718B experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped clarify several aspects of the presentation. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (PEU construction): The high-confidence threshold and logit transformation steps used to derive PEU scores are described as stabilizing utility estimation, yet the manuscript provides no sensitivity analysis or specific parameter values. Since this threshold is a free parameter, its choice could directly influence the reported 50% sparsity level and the 'nearly no performance loss' outcome.
Authors: We agree that explicit values and sensitivity analysis improve clarity. The revised §3 now states the high-confidence threshold of 0.85 and the logit transformation (softmax followed by linear scaling with coefficient 2.0). We have added a sensitivity plot and table demonstrating that performance remains within 0.5% of the dense baseline for thresholds in [0.75, 0.95], with sparsity varying smoothly; the chosen operating point balances the reported 50% sparsity target against minimal accuracy drop. revision: yes
-
Referee: [§4] §4 (Experimental evaluation): The central claim of up to 50% sparsity with nearly no performance loss across 30B–718B models is presented without details on calibration set composition and size, evaluation datasets, baselines, error bars, or statistical tests. This absence makes it impossible to assess whether the results are robust or dependent on particular data choices.
Authors: We accept that additional experimental details are necessary. The revised §4 specifies the calibration set as 1,000–2,000 examples drawn from the target deployment distribution (or a balanced mix for generalists), lists all evaluation benchmarks (MMLU, GSM8K, HumanEval, and domain-specific suites), includes random and router-logit baselines, reports mean ± std over three random seeds, and adds paired statistical tests (Wilcoxon signed-rank) confirming that performance differences versus the dense model are not significant at p < 0.05 for the 50% sparsity setting. revision: yes
-
Referee: [§4.2] §4.2 (Generalization): The assumption that PEU rankings computed on a small calibration set will identify a sparse expert subset that preserves performance on the target deployment distribution is load-bearing for the main result, yet no experiments test this under distribution shift (e.g., mismatched calibration vs. test distributions or out-of-distribution inputs).
Authors: This is a fair observation. The cross-domain generalist results already provide indirect support by showing that PEU rankings derived from a mixed calibration set transfer across domains. In the revision we have expanded §4.2 with a dedicated paragraph on calibration-set representativeness and added a short mismatched-distribution experiment (calibration from one domain, evaluation on another) where the accuracy drop remains below 1.5% at 40% sparsity. We also note the assumption as a limitation in the discussion. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper presents PreMoE as a training-free procedure that computes Predicted Expert Utility scores from router logits on a calibration set via filtering and transformation, then uses the resulting rankings to select a sparse expert subset for deployment. The central performance claims (up to 50% sparsity with nearly no loss) are framed as outcomes of empirical evaluation on models from 30B to 718B parameters rather than as mathematical derivations. No equations or steps are shown that reduce a claimed prediction or first-principles result back to its own inputs by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The method therefore remains self-contained as a heuristic selection technique whose validity rests on external empirical testing rather than definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (1)
- high-confidence threshold
axioms (1)
- domain assumption Router logits provide a reliable signal of expert utility for unseen inputs in the target domain.
invented entities (1)
-
Predicted Expert Utility (PEU)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Compiling a Domain-Specific Specialist.This is the most straightforward application of a computational pattern
-
[2]
This forms the computational pattern for the domain
Pattern Identification:For a target domain T, calculate the PEU scores {PEUT i }Nr i=1 for all experts in each MoE layer across a calibration dataset XT. This forms the computational pattern for the domain
-
[3]
Expert Selection:For a given expert budget M, select the set of M experts with the highest PEU scores for each layer. This pruned set of experts becomes the new set of routed experts{E r i (x)} M i=1 for the compiled instance
-
[4]
The router weights are also pruned to remove parameters corresponding to the discarded experts
Instance Compilation:A new, sparse model instance is created containing only the selected routed experts. The router weights are also pruned to remove parameters corresponding to the discarded experts
-
[5]
Compiling a High-Efficiency Generalist.This strategy creates a single, sparse model that retains capability across multiple domains by creating a synthesized, multi-domain computational pattern
-
[6]
Synthesize Token-Level Scores:For a set of D target domains {T1, . . ., TD}, first collect the token-level utility scores, {˜si(x)}, by running the model over all tokens in all of their respective calibration datasets, {XT1, . . ., XTD }. All of these individual token-level scores are then aggregated into a single, large collection
-
[7]
PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x)
Calculate Multi-Domain PEU:A unified PEU score for the generalist model, PEUmulti i , is calculated by averaging all of the aggregated token-level scores. PEUmulti i = 1 ∑D d=1 |XTd | D ∑ d=1 ∑ x∈XTd ˜si(x). (11) This creates a single, blended PEU ranking that captures an expert’s importance across the full spectrum of targeted domains
-
[8]
This becomes the new set of routed experts {Er i (x)} M i=1
Expert Selection:For a given total expert budget M, select the set of M experts with the highest multi-domain PEU scores. This becomes the new set of routed experts {Er i (x)} M i=1
-
[9]
Instance Compilation:A new model instance is created containing only the final selected set of routed experts. This compilation process is performed once at deployment time, creating a static, efficient model instance that is proactively specialized for its intended application, whether that be single-domain or multi-domain. A.3 Generation Configuration F...
-
[10]
per layer. An alternative isglobalranking across all layers, keeping the top- K experts overall (e.g., 96×58 = 5568 total for DeepSeek-R1). Table C.4 compares these strategies. Table C.4: Local vs. global expert ranking on DeepSeek-R1 at 62.5% sparsity (96 experts per layer for local; 5568 total for global). Strategy MATH-500 GPQA LCB Local (96/layer) 96....
work page 2024
-
[11]
achieve performance comparable to much larger dense models through dynamic, sparse activation of parameters. E.2 Efficiency in Large Language Models The challenge of deploying massive LLMs has spurred extensive research into model efficiency. Techniques such as quantization (Dettmers et al., 2022; Frantar et al., 2022), which reduces the numerical precisi...
work page 2022
-
[12]
So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. So, 15. (Keep repeating……) DeepSeek-R1 (8/32) w/o output collection: Figure E.1: Comparison of reasoning generation of DeepSeek-R1 when collecting PEU patterns with or without considering the model’s reasoning output. The top example uses our default co...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.