arxiv: 2603.18492 · v2 · submitted 2026-03-19 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

AIMER: Calibration-Free Task-Agnostic MoE Pruning

Zongfang Liu , Shengkun Tang , Yifan Shen , Huan Wang , Xin Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsexpert pruningmodel compressioncalibration-freetask-agnosticlarge language modelsMoE

0 comments

The pith

A weight-only statistic ranks experts in MoE models without any calibration data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mixture-of-Experts language models keep many specialized sub-networks but must store every expert in memory. Prior task-agnostic pruning methods measure expert importance from activations or routing counts collected on a separate calibration set, which adds time and makes results depend on the chosen data. AIMER instead scores each expert directly from its weights using the absolute mean divided by the root mean square. This produces clear separation inside each layer and supports pruning at 25 percent or 50 percent ratios. On 7B to 30B parameter models evaluated across 16 benchmarks, the resulting pruned models match or exceed calibration-based baselines while completing the entire scoring step in 0.22 to 1.27 seconds.

Core claim

AIMER computes an importance score for each expert as the absolute mean of its weights divided by their root mean square, creating distinct within-layer stratification that permits effective task-agnostic pruning without calibration data or activation statistics.

What carries the argument

The AIMER score, defined as absolute mean over root mean square of expert weights, which separates experts by importance using only model parameters.

Load-bearing premise

The absolute mean over root mean square of an expert's weights reliably ranks its usefulness across any future task without needing to observe activations or data.

What would settle it

On a new MoE model, pruning with AIMER produces accuracy more than five percent lower on average than a calibration-based method across the same benchmarks when both are tested on distributions far from the original training data.

read the original abstract

Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25\% and 50\% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22--1.27 seconds for scoring the experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIMER gives a fast, calibration-free expert ranking for MoE pruning that matches calibration baselines on standard benchmarks, but its task-agnostic claim still needs checks on distribution shift.

read the letter

The core contribution is a parameter-free score—absolute mean divided by root mean square of the relevant statistics—that ranks experts inside each layer without any calibration data or fitted parameters. This removes the preprocessing step and the sensitivity to calibration set choice that most prior pruning methods carry. On the reported experiments it runs in 0.22–1.27 seconds and produces competitive or better average performance than calibration-based baselines when 25 % or 50 % of experts are dropped from 7 B to 30 B MoE models across 16 benchmarks. That combination of speed and parity is the practical takeaway. The paper does a clean job of showing that the score creates clear separation between experts within a layer, which is what allows simple top-k selection. The results are presented across multiple model scales, which adds some credibility to the claim that the rule is not tuned to one size. The main soft spot is the task-agnostic assertion. Because the statistics are computed on the model’s own weights or activations, they still reflect the pre-training distribution; nothing in the abstract or the stress-test note shows results on clearly shifted domains, so the ranking could quietly favor in-distribution behavior even though no explicit calibration set is used. The absence of error bars and detailed ablation tables in the summary also makes it hard to judge how stable the reported gains are. This paper is aimed at practitioners who need to shrink MoE models for inference without extra data pipelines. A reader already working on expert pruning or sparse inference will find the empirical comparisons useful and the method easy to re-implement. It deserves a serious referee because the central idea is simple, reproducible from the description, and the performance numbers are strong enough on the given benchmarks to justify closer examination and requests for out-of-distribution tests.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AIMER, a calibration-free expert importance scoring method for pruning Mixture-of-Experts (MoE) language models. AIMER computes per-expert scores as the absolute mean divided by the root-mean-square of the expert's weight matrix (or activations), producing a within-layer ranking that is used to prune at 25% or 50% ratios. The central empirical claim is that this parameter-free, data-free criterion matches or exceeds the accuracy of calibration-dependent baselines (e.g., routing-frequency or activation-magnitude methods) on 7B–30B MoE models across 16 benchmarks while requiring only 0.22–1.27 seconds of scoring time.

Significance. If the performance parity holds under rigorous statistical controls, AIMER removes a practical deployment bottleneck by eliminating calibration-set selection and the associated preprocessing cost. The method’s simplicity, speed, and lack of fitted parameters constitute a clear engineering contribution for serving large MoE models under memory constraints.

major comments (3)

[§4.2, Table 2] §4.2 and Table 2: the reported accuracy numbers are given as single-point estimates without standard deviations, confidence intervals, or the number of random seeds. Because the central claim is that AIMER is “consistently competitive or stronger,” the absence of variability measures makes it impossible to assess whether observed differences are statistically meaningful or within noise.
[§3.1, Eq. (3)] §3.1, Eq. (3): the importance score is defined solely from the training-distribution statistics of the expert weights. The manuscript provides no out-of-distribution or domain-shift experiments (e.g., code vs. math vs. multilingual test sets) that would directly test whether the fixed ranking remains task-agnostic when the downstream distribution diverges from pre-training data.
[§4.3] §4.3: the ablation that isolates the contribution of the absolute-mean versus RMS normalization is missing; without it, it is unclear whether the reported gains are driven by the specific functional form or simply by any reasonable magnitude-based ranking.

minor comments (2)

Notation: the manuscript alternates between “expert weight matrix” and “activation tensor” when describing the input to the RMS operation; a single consistent symbol and a short appendix derivation would remove ambiguity.
Figure 1 caption: the y-axis label “Importance Score” should explicitly state the normalization (absolute mean / RMS) so readers can reproduce the plot from the text alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful comments, which help improve the clarity and rigor of our work. We address each major point below.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2 and Table 2: the reported accuracy numbers are given as single-point estimates without standard deviations, confidence intervals, or the number of random seeds. Because the central claim is that AIMER is “consistently competitive or stronger,” the absence of variability measures makes it impossible to assess whether observed differences are statistically meaningful or within noise.

Authors: We agree with this observation. To address it, we will rerun the experiments reported in Table 2 using three different random seeds for the pruning process (where applicable) and report the mean accuracy along with standard deviations. This will be included in the revised manuscript to demonstrate that the performance differences are consistent and not due to noise. revision: yes
Referee: [§3.1, Eq. (3)] §3.1, Eq. (3): the importance score is defined solely from the training-distribution statistics of the expert weights. The manuscript provides no out-of-distribution or domain-shift experiments (e.g., code vs. math vs. multilingual test sets) that would directly test whether the fixed ranking remains task-agnostic when the downstream distribution diverges from pre-training data.

Authors: While our evaluation spans 16 diverse benchmarks covering various domains (including code, math, and multilingual tasks), we acknowledge the value of explicit domain-shift experiments. In the revision, we will add a discussion in §3.1 on why the weight-based criterion is expected to be robust to distribution shifts, and include results on additional domain-specific benchmarks if space permits. revision: partial
Referee: [§4.3] §4.3: the ablation that isolates the contribution of the absolute-mean versus RMS normalization is missing; without it, it is unclear whether the reported gains are driven by the specific functional form or simply by any reasonable magnitude-based ranking.

Authors: We will add the requested ablation study to §4.3 in the revised manuscript. Specifically, we will compare AIMER against two variants: one using only the absolute mean and another using only the RMS of the weights. This will clarify that the ratio provides superior expert stratification and pruning performance compared to the individual components. revision: yes

Circularity Check

0 steps flagged

No significant circularity in AIMER's importance scoring

full rationale

The paper defines AIMER directly via a simple statistical operation (absolute mean over RMS) on activation values with no fitted parameters, no self-referential equations, and no reduction of any claimed prediction back to its own inputs. No load-bearing self-citations or uniqueness theorems are invoked to justify the core criterion. Performance claims rest on empirical comparisons to baselines rather than a closed derivation chain. This is the most common honest finding for a paper whose central contribution is an explicit, parameter-free statistic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that raw activation statistics suffice for expert importance without task-specific calibration; no free parameters are introduced and no new entities are postulated.

axioms (1)

domain assumption Activation statistics computed on the model itself without external calibration data provide a reliable task-agnostic ranking of expert importance
Invoked when the paper claims the score yields clear separation and competitive performance independent of calibration set choice.

pith-pipeline@v0.9.0 · 5476 in / 1226 out tokens · 56055 ms · 2026-05-15T08:17:40.858277+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

AIMER(w) = ||w||1 / sqrt(N ||w||2) ... scale-invariant ... bounded 1/sqrt(N) ≤ AIMER(w) ≤ 1
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

both AIMER and the Hoyer metric are functions of the same underlying ℓ1/ℓ2 ratio

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 8.0

HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-wei...