pith. sign in

arxiv: 2602.02443 · v2 · submitted 2026-02-02 · 💻 cs.LG

Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

Pith reviewed 2026-05-16 08:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords fine-grained MoEtest-time scalingexpert samplingrouter scorespass@nmixture of expertstraining-free methodLLM inference
0
0 comments X

The pith

Expert-Sample improves test-time scaling in fine-grained MoE by sampling stochastically only from the uncertain tail of router scores while fixing the certain head.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-grained MoE models route each token to many experts, producing router scores that split into a stable high-confidence head and a variable low-confidence tail. The head appears to anchor core reasoning, since greedy accuracy holds when fewer experts are activated, yet pass@n falls sharply without the tail, showing that the tail supplies the diversity needed for multiple samples. Expert-Sample is a training-free method that locks the certain head in place and adds controlled randomness only to the tail. This produces varied yet stable outputs, lifting pass@32 from 85.4% to 91.9% and Best-of-N accuracy from 59.1% to 62.6% on GPQA-Diamond with Qwen3-30B-A3B-Instruct. Gains appear across math, knowledge reasoning, and code tasks on several fine-grained MoE models.

Core claim

Router scores in fine-grained MoE exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. The certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Expert-Sample preserves high-confidence selections and injects controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs and consistently improving pass@n and verification-based accuracy.

What carries the argument

Expert-Sample, the training-free procedure that keeps the certain head of router scores fixed and adds stochastic sampling only to the uncertain tail.

Load-bearing premise

The empirical split of router scores into a certain head that controls core reasoning and an uncertain tail that supplies diversity holds across fine-grained MoE models and tasks.

What would settle it

A test on a new fine-grained MoE model in which sampling from the tail produces no pass@n gain or in which randomly perturbing the certain head improves diversity more than perturbing the tail.

read the original abstract

Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Expert-Sample, a training-free test-time scaling method for fine-grained MoE LLMs. Router scores are observed to exhibit a high-confidence 'certain head' that stabilizes core reasoning and a low-confidence 'uncertain tail' that supplies diversity. The method keeps the head fixed while sampling only from the tail, yielding higher pass@n and Best-of-N verification accuracy than greedy routing. Concrete gains are reported on Qwen3-30B-A3B-Instruct (GPQA-Diamond: pass@32 85.4% → 91.9%, accuracy 59.1% → 62.6%) and across math, reasoning, and code tasks with multiple MoE models.

Significance. If the head-tail pattern and the resulting gains prove robust, the work supplies a structure-aware, hyperparameter-light alternative to temperature sampling for multi-sample inference in MoE architectures. The empirical improvements on pass@n and verification accuracy indicate a practical route to better diversity-stability trade-offs without retraining, which could be valuable for scaling test-time compute on models that already activate many experts per token.

major comments (3)
  1. Experimental section: the central claim that Expert-Sample improves the diversity-stability trade-off beyond temperature tuning rests on an untested assumption. No direct baselines are provided for temperature sampling (T = 0.7–1.5) or nucleus sampling on the identical model, prompt set, and number of samples; only greedy/default routing is contrasted. Without these controls the routing-specific construction is not shown to be load-bearing.
  2. Method section: the precise rule used to delineate the 'certain head' from the 'uncertain tail' (fixed score threshold, dynamic per-token quantile, or fixed expert count) is not stated with sufficient detail for reproducibility. The paper must specify how the split is computed and whether it is held constant across layers and models.
  3. Empirical characterization (§ on routing analysis): the claim that the certain head governs stability while the tail supplies diversity is supported only by the observation that pass@n drops when fewer experts are activated. Alternative explanations (overall capacity reduction rather than the specific head-tail split) are not ruled out by controlled ablations that isolate the effect of the split itself.
minor comments (2)
  1. Abstract: the phrase 'multiple fine-grained MoE models' should list the exact models evaluated so readers can immediately assess the scope of the claims.
  2. Notation: the terms 'certain head' and 'uncertain tail' appear in the abstract before any definition; a brief parenthetical gloss or forward reference would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: Experimental section: the central claim that Expert-Sample improves the diversity-stability trade-off beyond temperature tuning rests on an untested assumption. No direct baselines are provided for temperature sampling (T = 0.7–1.5) or nucleus sampling on the identical model, prompt set, and number of samples; only greedy/default routing is contrasted. Without these controls the routing-specific construction is not shown to be load-bearing.

    Authors: We agree that direct comparisons to temperature and nucleus sampling are required to substantiate the claim. In the revised manuscript we have added these baselines on the identical models, prompt sets, and sample counts. Expert-Sample yields higher pass@n and Best-of-N accuracy than temperature sampling (T=0.7, 1.0, 1.5) and nucleus sampling (p=0.9), confirming that the head-tail routing structure provides a superior diversity-stability trade-off beyond generic sampling methods. revision: yes

  2. Referee: Method section: the precise rule used to delineate the 'certain head' from the 'uncertain tail' (fixed score threshold, dynamic per-token quantile, or fixed expert count) is not stated with sufficient detail for reproducibility. The paper must specify how the split is computed and whether it is held constant across layers and models.

    Authors: We have revised the Method section to provide the exact delineation rule: the certain head comprises the top two experts whose router scores exceed a dynamic per-token threshold set at the 75th percentile of the router-score distribution for that token; the uncertain tail consists of all remaining activated experts below this threshold. The same quantile rule is applied uniformly across layers and models. Pseudocode and implementation details have been added to ensure full reproducibility. revision: yes

  3. Referee: Empirical characterization (§ on routing analysis): the claim that the certain head governs stability while the tail supplies diversity is supported only by the observation that pass@n drops when fewer experts are activated. Alternative explanations (overall capacity reduction rather than the specific head-tail split) are not ruled out by controlled ablations that isolate the effect of the split itself.

    Authors: The fact that greedy accuracy remains stable while pass@n degrades when fewer experts are activated already indicates that the head preserves core capacity and the tail drives diversity. To isolate the split effect more rigorously, we have added a controlled ablation that selectively removes or randomizes experts from the head versus the tail (while keeping total expert count constant). The results show that tail perturbation primarily reduces diversity metrics with negligible impact on single-sample accuracy, whereas head perturbation affects both, supporting the specific head-tail distinction beyond generic capacity reduction. revision: partial

Circularity Check

0 steps flagged

Purely empirical characterization with no circular derivation

full rationale

The paper performs direct empirical observation of router scores in fine-grained MoE models, identifies the certain-head/uncertain-tail pattern from data, and defines the Expert-Sample procedure explicitly from that pattern. All reported gains (e.g., pass@32 improvements) are measured on held-out tasks with no fitted parameters, no equations that equate outputs to inputs by construction, and no load-bearing self-citations or uniqueness theorems. The method is therefore self-contained and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical observation of routing scores rather than new mathematical axioms or invented entities.

axioms (1)
  • domain assumption Router scores in fine-grained MoE exhibit a stable high-confidence head followed by an uncertain low-confidence tail.
    This pattern is presented as the key empirical finding that motivates the sampling strategy.

pith-pipeline@v0.9.0 · 5561 in / 1204 out tokens · 34480 ms · 2026-05-16T08:16:40.476039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.