ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Guy Van den Broeck; Heng Zhao; Zhe Zeng; Zilei Shao

arxiv: 2606.01509 · v1 · pith:X5HLR35Anew · submitted 2026-06-01 · 💻 cs.LG · cs.AI

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Heng Zhao , Zilei Shao , Guy Van den Broeck , Zhe Zeng This is my paper

Pith reviewed 2026-06-28 15:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of expertsprobabilistic routingdifferentiable discrete selectionexpert utilizationsparse activationcardinality constrained subsetsneural network scaling

0 comments

The pith

ProbMoE makes MoE expert routing differentiable by modeling selection as inference over subset distributions and using exact marginal probabilities for gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProbMoE to solve the non-differentiability problem in top-k routing for mixture-of-experts models. It represents expert selection as a probability distribution over subsets with fixed or bounded cardinality and casts routing as inference inside that discrete space. Exact-k routing samples a k-expert subset on the forward pass and routes the backward gradient through each expert's exact marginal probability. A dynamic-k extension keeps the same cardinality range at train and test time so the number of active experts can adapt per token. Readers would care because the approach reports stronger expert utilization and diversity while matching or exceeding baseline performance across standard benchmarks and backbones.

Core claim

ProbMoE models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. ProbMoE Exact-k routing samples k-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-k routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-k achieves strong performance compared to competitive baselin

What carries the argument

The exact marginal probability of each expert computed from the distribution over cardinality-constrained subsets, used as the surrogate gradient signal in the backward pass.

If this is right

Exact-k routing matches or exceeds baseline accuracy while raising expert utilization and routing diversity.
Dynamic-k routing matches baseline performance yet activates fewer experts on average by adapting cardinality within a fixed range.
The same probabilistic formulation applies across multiple model backbones and standard MoE benchmarks.
Both variants keep the cardinality constraint identical at training and inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The marginal-probability surrogate may reduce reliance on auxiliary load-balancing losses that are common in current MoE training.
The subset-distribution view could be applied to other discrete gating or selection mechanisms inside neural networks.
Measuring gradient variance directly between the marginal surrogate and alternative estimators would quantify how much the method stabilizes training.
Dynamic cardinality might yield additional efficiency gains when input complexity varies widely across a dataset.

Load-bearing premise

That routing the gradient through each expert's exact marginal probability rather than through the sampled subset gives a low-variance unbiased surrogate for the true discrete selection gradient.

What would settle it

An experiment that replaces the marginal-probability gradient with a direct gradient through the sampled subset (or with REINFORCE) on the same models and benchmarks and measures whether convergence, final performance, or expert balance degrades.

Figures

Figures reproduced from arXiv: 2606.01509 by Guy Van den Broeck, Heng Zhao, Zhe Zeng, Zilei Shao.

**Figure 1.** Figure 1: Comparison of conventional Top-k training and ProbMoE training. Left: Conventional MoE applies a deterministic top-k operator to the softmax routing probabilities for expert selection, while propagating gradients only through these probabilities. Right: ProbMoE models expert routing as probabilistic inference over discrete expert subsets. ProbMoE samples an expert subset S from a cardinality-constrained di… view at source ↗

**Figure 2.** Figure 2: Ablation study of forward routing and backward gradient estimation under exact-k routing on OLMoE for GSM. Box plots show exact-match (EM) accuracy, where higher is better. imental setup of DenseMixer (Yao et al., 2026), using the same datasets, data splits, and evaluation protocols. We evaluate ProbMoE across a diverse set of tasks, including mathematical reasoning on GSM8K (Cobbe et al., 2021), legal-do… view at source ↗

**Figure 5.** Figure 5: Average routing cardinality at each layer under ProbMoE Dynamic-k with OLMoE backbone on different datasets. a broader set of experts to routing probability over the course of training. Rather than relying on a small group of consistently dominant experts, ProbMoE distributes routing probability more broadly, allowing a wider range of experts to participate across prompts and layers. This broader routing… view at source ↗

**Figure 6.** Figure 6: Token frequency versus average expert activation under ProbMoE Dynamic-k routing (over 656k tokens). Tokens are ordered by increasing frequency. The solid curve shows the average number of active experts per token, while the shaded histogram (right axis, log scale) shows token frequency. Dashed line indicates the expected activation under uniform expert assignment, with rarer tokens activating more exper… view at source ↗

**Figure 7.** Figure 7: Wall-clock and memory analysis on the GSM fine-tuning task. Comparison across Conventional fine-tuning, DenseMixer, ProbMoE DP, and ProbMoE SDD on OLMoE-1B-7B. (a) Per-step throughput in tokens per second over 30 training steps. (b) Steady-state per-step compute time. (c) End-to-end wall-clock time. (d) Peak GPU memory per GPU. Values in parentheses denote ratios relative to Conventional [PITH_FULL_IMAGE:… view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProbMoE frames routing as subset inference and uses exact marginals as gradient surrogates, but the cardinality dependence likely introduces bias the abstract never bounds.

read the letter

The core move is to treat expert selection as sampling from a distribution over cardinality-constrained subsets, then back-propagate through each expert's marginal inclusion probability rather than through the sampled subset itself. That is distinct from the usual Gumbel or straight-through tricks and is the main thing worth noting.

The paper does two concrete things. First, it spells out Exact-k, where the forward pass draws a k-subset and the backward pass substitutes the marginals. Second, it extends the same construction to Dynamic-k so that both training and inference respect the same cardinality range, which lets the model allocate different numbers of experts per token. The abstract claims this yields better utilization and diversity than standard baselines.

The soft spot is exactly the one the stress-test note flags. Because the subset is forced to size k, the inclusion indicators are negatively dependent; the gradient of the expected loss with respect to the logits is not the sum of the per-expert marginal gradients. The abstract presents the marginal route as a tractable surrogate without deriving the bias or showing it is negligible, so the stability claim rests on an unverified equality. The lack of any numbers, ablations, or variance numbers in the abstract makes it impossible to judge whether the method actually works in practice.

This is for people who build or scale sparse models and need a new handle on the routing gradient. A reader who already knows the MoE literature will see the marginal construction quickly and can decide whether the dependence issue is fatal or fixable.

I would send it to peer review. The problem is central, the construction is explicit, and the dynamic-k extension is a natural next step; referees can check whether the bias is controlled and whether the experiments hold up.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProbMoE, a probabilistic routing framework for Mixture-of-Experts that treats expert selection as inference over cardinality-constrained subsets. Exact-k routing samples k-expert subsets in the forward pass and routes gradients through each expert's exact marginal probability p_i as a surrogate for the discrete selection gradient; Dynamic-k extends this to allow per-token k within a fixed range. The manuscript claims that both variants achieve strong benchmark performance relative to baselines, with improved expert utilization and routing diversity.

Significance. If the marginal-probability surrogate yields stable, low-bias training dynamics, the approach would supply a more principled and differentiable alternative to existing top-k routing estimators, potentially improving both performance and expert load balancing in large-scale MoE models.

major comments (2)

[Abstract / §3 (method description)] The central technical claim—that routing the gradient through each expert’s exact marginal probability p_i supplies a tractable and sufficiently accurate surrogate for the true gradient of E[loss | subset]—is load-bearing for all reported results. No derivation or bias bound is supplied showing that the per-expert marginal gradients equal the gradient of the cardinality-constrained expectation; the negative dependence among indicators induced by the exact-k constraint is not addressed.
[Abstract] The abstract asserts “strong performance … with improved expert utilization” yet supplies neither quantitative tables, ablation results, nor error bars. Without these data the empirical support for the surrogate’s effectiveness cannot be evaluated.

minor comments (1)

[§3] Notation for the marginal probability p_i and the sampling distribution over subsets should be introduced with explicit equations rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. Below we provide point-by-point responses to the major comments. We propose revisions where they strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract / §3 (method description)] The central technical claim—that routing the gradient through each expert’s exact marginal probability p_i supplies a tractable and sufficiently accurate surrogate for the true gradient of E[loss | subset]—is load-bearing for all reported results. No derivation or bias bound is supplied showing that the per-expert marginal gradients equal the gradient of the cardinality-constrained expectation; the negative dependence among indicators induced by the exact-k constraint is not addressed.

Authors: We agree that an explicit derivation of the marginal-probability surrogate and its relationship to the cardinality-constrained expectation would improve clarity. In the revised manuscript we will add a dedicated subsection in §3 that (i) derives the exact marginal probabilities under the cardinality constraint, (ii) shows that the per-expert gradient through p_i is the expectation of the indicator gradient conditional on the remaining experts, and (iii) discusses the effect of negative dependence among the indicators. A formal bias bound is not currently derived in the paper; we will include a short discussion of the approximation error together with empirical evidence that the surrogate remains stable across the reported scales. revision: yes
Referee: [Abstract] The abstract asserts “strong performance … with improved expert utilization” yet supplies neither quantitative tables, ablation results, nor error bars. Without these data the empirical support for the surrogate’s effectiveness cannot be evaluated.

Authors: Abstracts are intentionally concise and do not contain tables or error bars; the quantitative support for the performance claims, including benchmark tables, ablation studies on Exact-k versus Dynamic-k, expert utilization histograms, routing diversity metrics, and results with standard error bars from multiple random seeds, is provided in §4 and the associated figures. We will verify that every claim in the abstract is directly traceable to a specific result in the experimental section and, if space permits, add one or two representative numbers to the abstract for immediate context. revision: partial

Circularity Check

0 steps flagged

No circularity: direct probabilistic construction with empirical validation

full rationale

The paper presents ProbMoE as a probabilistic model over cardinality-constrained subsets, with forward sampling of k-subsets and backward use of per-expert marginal probabilities as an explicit surrogate. No equations reduce the surrogate gradient to the true gradient by algebraic identity, no parameters are fitted on a data subset and then relabeled as predictions, and no self-citations are used to justify uniqueness or load-bearing premises. Performance results are reported as empirical outcomes on benchmarks rather than derived from the method's own inputs. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The approach relies on standard discrete probability over subsets and the existence of tractable marginals.

pith-pipeline@v0.9.1-grok · 5731 in / 1095 out tokens · 19378 ms · 2026-06-28T15:56:58.571598+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · 5 internal anchors

[1]

A., Jin, H., and Wu, Y

Aghdam, M. A., Jin, H., and Wu, Y . Da-moe: Towards dynamic expert allocation for mixture-of-experts models. CoRR, abs/2409.06669,

work page arXiv
[2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Jang, E., Gu, S., and Poole, B

doi: 10.18653/v1/2024.acl-long.696. Jang, E., Gu, S., and Poole, B. Categorical reparameteriza- tion with gumbel-softmax. InInternational Conference on Learning Representations,

work page doi:10.18653/v1/2024.acl-long.696 2024
[5]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Jin, C., Peng, H., Xiang, M., Zhang, Q., Yuan, X., Hasan, A., Dibua, O., Gong, Y ., Kang, Y ., and Metaxas, D. N. Sparsity-controllable dynamic top-p moe for large founda- tion model pre-training.arXiv preprint arXiv:2512.13996, 2025a. Jin, P., Zhu, B., Yuan, L., and YAN, S. Moe++: Accelerat- ing mixture-of-experts methods with zero-computation experts. I...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Sparse backpropagation for moe training

Liu, L., Gao, J., and Chen, W. Sparse backpropagation for moe training. InWorkshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Re- source Optimization (WANT@NeurIPS 2023),

2023
[8]

Probabilistically rewired message-passing neural networks

Qian, C., Manolache, A., Ahmed, K., Zeng, Z., Van den Broeck, G., Niepert, M., and Morris, C. Probabilistically rewired message-passing neural networks. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32051–32076,

2024
[9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

URL https://qwenlm.github.io/blog/qwen1. 5/. Wang, Z., Chen, D., Dai, D., Xu, R., Li, Z., and Wu, Y . Let the expert stick to his last: Expert-specialized fine-tuning for sparse architectural large language mod- els. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processi...

2024
[11]

Wang, Z., Zhu, J., and Chen, J

doi: 10.18653/v1/2024.emnlp-main.46. Wang, Z., Zhu, J., and Chen, J. Remoe: Fully differentiable mixture-of-experts with reLU routing. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/2024.emnlp-main.46 2024
[12]

AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models

Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 6223–6235, Miami, Florida, USA, November

2024
[13]

doi: 10.18653/v1/2024.findings-emnlp.361

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.361. Zuo, S., Liu, X., Jiao, J., Kim, Y . J., Hassan, H., Zhang, R., Gao, J., and Zhao, T. Taming sparsely activated transformer with stochastic experts. InInternational Con- ference on Learning Representations,

work page doi:10.18653/v1/2024.findings-emnlp.361 2024
[14]

Methodological and Implementation Details A.1

12 ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts A. Methodological and Implementation Details A.1. DenseMixer: Dense Training-Time Routing DenseMixer addresses the non-differentiability of top-k routing by modifying gradient propagation during training (Yao et al., 2026). Rather than modeling expert subset selection directly, it int...

2026
[15]

DenseMixer requires a dense forward pass through all experts for every token in order to compute its straight-through gradient, dramatically increasing forward-pass FLOPs

In terms of peak GPU memory, ProbMoE DP and ProbMoE SDD closely match Conventional fine-tuning, whereas DenseMixer requires substantially more memory than any other method. DenseMixer requires a dense forward pass through all experts for every token in order to compute its straight-through gradient, dramatically increasing forward-pass FLOPs. In contrast,...

2023

[1] [1]

A., Jin, H., and Wu, Y

Aghdam, M. A., Jin, H., and Wu, Y . Da-moe: Towards dynamic expert allocation for mixture-of-experts models. CoRR, abs/2409.06669,

work page arXiv

[2] [2]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y ., L´eonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation.arXiv preprint arXiv:1308.3432,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Jang, E., Gu, S., and Poole, B

doi: 10.18653/v1/2024.acl-long.696. Jang, E., Gu, S., and Poole, B. Categorical reparameteriza- tion with gumbel-softmax. InInternational Conference on Learning Representations,

work page doi:10.18653/v1/2024.acl-long.696 2024

[5] [5]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Jin, C., Peng, H., Xiang, M., Zhang, Q., Yuan, X., Hasan, A., Dibua, O., Gong, Y ., Kang, Y ., and Metaxas, D. N. Sparsity-controllable dynamic top-p moe for large founda- tion model pre-training.arXiv preprint arXiv:2512.13996, 2025a. Jin, P., Zhu, B., Yuan, L., and YAN, S. Moe++: Accelerat- ing mixture-of-experts methods with zero-computation experts. I...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Sparse backpropagation for moe training

Liu, L., Gao, J., and Chen, W. Sparse backpropagation for moe training. InWorkshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Re- source Optimization (WANT@NeurIPS 2023),

2023

[8] [8]

Probabilistically rewired message-passing neural networks

Qian, C., Manolache, A., Ahmed, K., Zeng, Z., Van den Broeck, G., Niepert, M., and Morris, C. Probabilistically rewired message-passing neural networks. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32051–32076,

2024

[9] [9]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

URL https://qwenlm.github.io/blog/qwen1. 5/. Wang, Z., Chen, D., Dai, D., Xu, R., Li, Z., and Wu, Y . Let the expert stick to his last: Expert-specialized fine-tuning for sparse architectural large language mod- els. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processi...

2024

[11] [11]

Wang, Z., Zhu, J., and Chen, J

doi: 10.18653/v1/2024.emnlp-main.46. Wang, Z., Zhu, J., and Chen, J. Remoe: Fully differentiable mixture-of-experts with reLU routing. InThe Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/2024.emnlp-main.46 2024

[12] [12]

AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models

Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. In Al-Onaizan, Y ., Bansal, M., and Chen, Y .-N. (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 6223–6235, Miami, Florida, USA, November

2024

[13] [13]

doi: 10.18653/v1/2024.findings-emnlp.361

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.361. Zuo, S., Liu, X., Jiao, J., Kim, Y . J., Hassan, H., Zhang, R., Gao, J., and Zhao, T. Taming sparsely activated transformer with stochastic experts. InInternational Con- ference on Learning Representations,

work page doi:10.18653/v1/2024.findings-emnlp.361 2024

[14] [14]

Methodological and Implementation Details A.1

12 ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts A. Methodological and Implementation Details A.1. DenseMixer: Dense Training-Time Routing DenseMixer addresses the non-differentiability of top-k routing by modifying gradient propagation during training (Yao et al., 2026). Rather than modeling expert subset selection directly, it int...

2026

[15] [15]

DenseMixer requires a dense forward pass through all experts for every token in order to compute its straight-through gradient, dramatically increasing forward-pass FLOPs

In terms of peak GPU memory, ProbMoE DP and ProbMoE SDD closely match Conventional fine-tuning, whereas DenseMixer requires substantially more memory than any other method. DenseMixer requires a dense forward pass through all experts for every token in order to compute its straight-through gradient, dramatically increasing forward-pass FLOPs. In contrast,...

2023