pith. sign in

arxiv: 2606.10703 · v2 · pith:G46PY77Jnew · submitted 2026-06-09 · 💻 cs.LG · cs.CL

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Pith reviewed 2026-06-27 14:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture-of-expertscausal interpretabilityexpert pruningrouting statisticsinterventional auditobservational metricsmodel compressionMoE architectures
0
0 comments X

The pith

Observational routing metrics do not predict causal expert importance in Mixture-of-Experts models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether statistics such as expert utilization rates or activation norms can forecast what happens when an expert is removed from a Mixture-of-Experts model. It runs token-level interventions that change which expert processes each token and measures the resulting change in model behavior. Across three different high-redundancy MoE models and sixty metric-layer pairs, every observational statistic shows effect sizes below Cohen's d of 0.23 and fails a dual statistical test. A control intervention that directly alters routing weights does recover a detectable signal, showing the audit has power. The results indicate that existing pruning methods succeed only because early layers contain enough redundant experts that almost any selection rule works.

Core claim

A token-level interventional audit across three high-redundancy MoE architectures finds no observational metric predicts causal expert importance in any model: across all 60 metric-layer combinations effect sizes stay below Cohen's d = 0.23, and no metric is reliably positive under our corrected, dual-test criterion. A per-token routing weight control, run with identical n, rules out insufficient power, recovering a signal whose CI excludes zero at OLMoE's final MoE layer (d = +0.231, 95% CI [+0.09, +0.37], p = 0.0013). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable.

What carries the argument

The token-level interventional audit that forces or blocks routing to specific experts on individual tokens and measures the downstream effect on model output.

If this is right

  • Pruning decisions based on utilization rates, activation norms, or routing weights do not target experts whose removal changes model behavior.
  • Early-layer redundancy makes most selection criteria interchangeable, so pruning succeeds without identifying truly dispensable experts.
  • Population-level observational summaries cannot be treated as evidence for token-level interventional claims about expert importance.
  • The same inferential gap from rung-1 associations to rung-2 interventions appears in at least one concrete interpretability practice.
  • A direct routing-weight control produces measurable effects, confirming that causal signals are detectable when they exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same audit design could be applied to other neural-network components where observational proxies are used to justify edits or pruning.
  • High redundancy may be masking the need for more precise expert routing in current MoE training regimes.
  • Testing the same metrics in lower-redundancy or differently scaled MoE models would show whether the disconnect is architecture-specific.
  • The results supply a concrete template for checking whether other interpretability claims rest on untested moves from observation to intervention.

Load-bearing premise

The chosen token-level routing interventions isolate the causal contribution of individual experts without introducing confounding changes to other model computations or to the router itself.

What would settle it

Observing an observational metric whose effect size exceeds Cohen's d = 0.5 and passes the dual-test criterion when predicting the outcome of expert-removal interventions in any of the three tested models would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10703 by Christian Medeiros Adriano, Holger Giese, Leonard Engmann.

Figure 1
Figure 1. Figure 1: In OLMoE, effect size grows monotonically with depth and reaches d = +0.231 at Layer 15 (p = 0.0013, the one result in the entire experiment to survive Bonfer￾roni correction). Its 95% CI, [+0.09, +0.37], excludes zero under identical n; the contrast with the observational cells is therefore one of where effects are centred, not of sta￾tistical power. Qwen and DeepSeek show no comparable depth concentratio… view at source ↗
Figure 2
Figure 2. Figure 2: Two-level dissociation in redistribution: Spearman ρ across depth for OLMoE-1B-7B-0924 and Qwen1.5-MoE-A2.7B. Left: routing weight versus gap norm; the router tracks expert contribution magnitude at every layer in both models. Right: gap norm versus relative compensation; in Qwen the second chain closes from Layer 6, in OLMoE only at Layer 15. Significance markers: ∗ p < 0.05; ∗∗p < 0.01; ∗∗∗p < 0.001 [PI… view at source ↗
read the original abstract

Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance in any model: across all 60 metric-layer combinations effect sizes stay below Cohen's $d = 0.23$, and no metric is reliably positive under our corrected, dual-test criterion. A per-token routing weight control, run with identical $n$, rules out insufficient power, recovering a signal whose CI excludes zero at OLMoE's final MoE layer ($d = +0.231$, 95\% CI $[+0.09, +0.37]$, $p = 0.0013$). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that common observational metrics (utilization rates, activation norms, routing weight distributions) do not predict the causal importance of experts in Mixture-of-Experts models under token-level routing interventions. An audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds effect sizes below Cohen's d = 0.23 for all 60 metric-layer combinations, with no metric reliably positive under a corrected dual-test criterion. A per-token routing-weight positive control recovers a detectable signal (d = +0.231, 95% CI [+0.09, +0.37], p = 0.0013 at OLMoE's final layer), and the authors conclude that pruning succeeds due to early-layer redundancy rather than precise expert identification.

Significance. If the interventional results hold, the work supplies a concrete counterexample to the routine use of rung-1 observational summaries to support rung-2 interventional claims in interpretability. The explicit reporting of effect sizes, confidence intervals, p-values, and a positive control that rules out insufficient power are strengths that make the negative finding falsifiable and reproducible. This calibrates evidential standards for MoE pruning methods and suggests that apparent success of observational pruning criteria stems from interchangeability under redundancy rather than accurate causal identification.

minor comments (3)
  1. [Methods] The Methods section should include a table or supplementary figure summarizing effect sizes, CIs, and test outcomes for all 60 metric-layer combinations to allow direct inspection of the uniformity claim.
  2. [Methods] The exact definition and implementation of the 'corrected, dual-test criterion' is referenced in the abstract but would benefit from an explicit algorithmic description or pseudocode in the main text for reproducibility.
  3. [Experimental Setup] The manuscript would be strengthened by reporting the precise number of tokens and layers analyzed per model, as well as any preprocessing steps for the token-level interventions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive review, accurate summary of our findings, and recommendation for minor revision. The report correctly identifies the core contribution as a falsifiable counterexample to rung-1 to rung-2 inference in MoE interpretability, along with the value of the positive control and explicit effect-size reporting.

Circularity Check

0 steps flagged

No circularity in empirical interventional audit

full rationale

The paper derives its central claim from direct token-level routing interventions on three MoE models, followed by computation of Cohen's d effect sizes and dual-test statistical criteria on the resulting output changes. These steps are experimental measurements of causal effects, not self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations. The per-token routing-weight positive control is an independent verification of statistical power using the same n, and no equations or prior-author citations are invoked to force the null result. The audit is self-contained against external benchmarks via the reported interventions and CIs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical study that relies on standard causal-inference and statistical assumptions without introducing new free parameters or postulated entities.

axioms (1)
  • domain assumption Token-level routing interventions can be performed in a manner that isolates the causal effect of a single expert.
    The audit design treats the router output as directly manipulable without side effects on other computations.

pith-pipeline@v0.9.1-grok · 5844 in / 1339 out tokens · 32510 ms · 2026-06-27T14:16:21.722050+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Modular Is a Frontier Mixture-of-Experts? A Pre-registered Causal Test in Which Apparent Expert Modularity Mostly Dissolves

    cs.LG 2026-06 conditional novelty 8.0

    Pre-registered ablation tests on Command A+ reveal that only one of six expert families (Arabic) shows clean selective modularity; all others fail selectivity or are measurement-dependent.

Reference graph

Works this paper leans on

11 extracted references · cited by 1 Pith paper

  1. [1]

    2024 , eprint=

    MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router , author=. 2024 , eprint=

  2. [2]

    2019 , eprint=

    Attention is not Explanation , author=. 2019 , eprint=

  3. [3]

    2020 , eprint=

    Sanity Checks for Saliency Maps , author=. 2020 , eprint=

  4. [4]

    2026 , eprint=

    Causality is Key for Interpretability Claims to Generalise , author=. 2026 , eprint=

  5. [5]

    2022 , eprint=

    Task-Specific Expert Pruning for Sparse Mixture-of-Experts , author=. 2022 , eprint=

  6. [6]

    2024 , eprint=

    SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts , author=. 2024 , eprint=

  7. [7]

    2025 , eprint=

    Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations , author=. 2025 , eprint=

  8. [8]

    2025 , eprint=

    OLMoE: Open Mixture-of-Experts Language Models , author=. 2025 , eprint=

  9. [9]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  10. [10]

    2016 , eprint=

    Pointer Sentinel Mixture Models , author=. 2016 , eprint=

  11. [11]

    Probabilistic and causal inference: the works of Judea Pearl , pages=

    On Pearl's hierarchy and the foundations of causal inference , author=. Probabilistic and causal inference: the works of Judea Pearl , pages=. 2022 , publisher=