Expert Routing for Communication-Efficient MoE via Finite Expert Banks

Ali Khalesi; Mohammad Reza Deylam Salehi

arxiv: 2605.05278 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.IT· math.IT

Expert Routing for Communication-Efficient MoE via Finite Expert Banks

Mohammad Reza Deylam Salehi , Ali Khalesi This is my paper

Pith reviewed 2026-05-08 16:33 UTC · model grok-4.3

classification 💻 cs.LG cs.ITmath.IT

keywords mixture of expertsexpert routingmutual informationgeneralization gapcommunication efficiencystochastic channelfinite expert bankaccuracy-rate curve

0 comments

The pith

Finite expert banks let algorithmic mutual information track the generalization gap in mixture-of-experts routing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a finite bank of pretrained CNN experts on MNIST together with a discrete data-dependent selection rule so that information-theoretic analysis of MoE gating becomes computable. Treating the gate as a stochastic channel, it quantifies routing information through mutual information terms and introduces closed-form estimators for the algorithmic mutual information between the selection variable and the selected weights. Sweeping the data-dependence parameter shows that this estimated quantity follows the generalization gap monotonically while standard bounds stay loose. The same construction supports an empirical estimator for input-to-expert mutual information and a Blahut-Arimoto procedure that traces accuracy-rate curves over the bank.

Core claim

By restricting the expert set to a finite collection of pretrained CNNs and adopting a discrete selection rule whose dependence on data is controlled by a single parameter, the algorithmic mutual information I(S;W) admits an exact closed-form estimator from the empirical posterior q(W|S). As the data-dependence parameter alpha is swept, this quantity tracks the generalization gap, outperforming the Xu-Raginsky bound in tightness and also beating a uniform union-bound baseline. The framework further supplies an empirical estimator of I(X;T) and a Blahut-Arimoto routine that produces the achievable accuracy-rate frontier for the expert bank.

What carries the argument

The finite expert bank constructed from pretrained CNN experts on MNIST together with the discrete data-dependent selection rule, which renders the algorithmic mutual information I(S;W) computable from the empirical posterior via a closed-form discrete-entropy estimator.

If this is right

MoE systems can treat I(X;T) and the accuracy-rate curve as practical proxies when designing routing policies that trade communication cost against accuracy.
Data-dependent expert selection yields higher routing information efficiency than uniform selection over the same bank.
The closed-form estimator supplies a concrete diagnostic for when a chosen routing rule will generalize.
Resource-aware inference architectures gain an explicit tool for balancing computation, communication, and performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finite-bank construction could be replicated on other pretrained expert pools to analyze routing in vision or language transformers.
Replacing the discrete selection rule with a soft gating mechanism would test how closely the information quantities approximate real deployed MoE behavior.
The accuracy-rate curve produced by the Blahut-Arimoto procedure might serve as a benchmark when comparing learned routers against information-theoretically optimal ones.
If the tracking relation holds, mutual-information proxies could guide pruning or compression decisions inside larger MoE training loops.

Load-bearing premise

The finite expert bank built from pretrained CNNs on MNIST together with the discrete data-dependent selection rule is representative of routing behavior in large-scale MoE systems used in practice.

What would settle it

Observing whether the estimated I(S;W) continues to track the generalization gap monotonically when the same sweeping procedure is applied to a large-scale MoE model trained on a high-complexity dataset such as ImageNet.

Figures

Figures reproduced from arXiv: 2605.05278 by Ali Khalesi, Mohammad Reza Deylam Salehi.

**Figure 1.** Figure 1: Finite-bank MNIST protocol. A small sample view at source ↗

**Figure 2.** Figure 2: Empirical distribution of the MNIST finite-bank generalization gap b view at source ↗

read the original abstract

Resource-efficient machine learning increasingly uses sparse Mixture-of-Experts (MoE) architectures, where the gate acts as both a learning component and a routing interface controlling computation, communication, and accuracy. Motivated by finite-rate interpretations of MoE gating, we treat the gate as a stochastic channel and use $I(X;T)$ to quantify the routing information available to the selected expert. To make the associated information quantities tractable beyond synthetic examples, we develop a finite-bank MNIST construction using pretrained CNN experts and a discrete, data-dependent selection rule. Since the selected model belongs to a finite candidate set, the algorithmic mutual information $I(S;W)$ admits a closed-form discrete-entropy estimator from the empirical posterior $q(W|S)$. Sweeping a data-dependence parameter $\alpha$, we observe that $\widehat I(S;W)$ monotonically tracks the generalization gap, while the Xu-Raginsky bound exhibits the expected looseness. We also compare with a uniform union-bound baseline and introduce an empirical estimator of $I(X;T)$ together with a Blahut-Arimoto procedure for tracing an accuracy-rate curve over the expert bank. The proposed framework provides a practical tool for analyzing resource-aware MoE inference systems and for interpreting $I(X;T)$ and $D(R_g)$ as design proxies for efficient expert routing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Finite MNIST expert bank lets them compute a closed-form I(S;W) estimator that tracks generalization gap under discrete selection, but the result stays inside a toy regime far from real MoE routing.

read the letter

The paper's main contribution is a controlled construction: a small bank of pretrained CNN experts on MNIST plus a discrete, data-dependent selection rule. This makes the algorithmic mutual information I(S;W) computable in closed form from the empirical posterior q(W|S) over the finite set. Sweeping the dependence parameter alpha, they observe that their estimator tracks the generalization gap monotonically and sits tighter than the Xu-Raginsky bound. They also supply an empirical I(X;T) estimator and run Blahut-Arimoto to draw an accuracy-rate curve over the bank. That package is a practical, reproducible way to turn routing analysis into something you can actually calculate on a laptop.

Referee Report

2 major / 2 minor

Summary. The paper models the MoE gating mechanism as a stochastic channel and quantifies routing information via I(X;T). To render algorithmic mutual information I(S;W) tractable, it constructs a finite expert bank from pretrained CNNs on MNIST together with a discrete, data-dependent selection rule controlled by a parameter α. From the resulting empirical posterior q(W|S) over the finite candidate set, it derives a closed-form estimator Î(S;W) and reports that sweeping α produces monotonic tracking of the generalization gap, outperforming the Xu-Raginsky bound; it further introduces an empirical estimator of I(X;T) and a Blahut-Arimoto procedure to trace accuracy-rate curves over the expert bank.

Significance. If the observed monotonic tracking generalizes, the finite-bank construction and closed-form estimator would supply a practical, reproducible tool for using information quantities as design proxies in resource-constrained MoE inference. The explicit comparison against the Xu-Raginsky bound and the introduction of the Blahut-Arimoto accuracy-rate procedure are concrete strengths that could aid interpretability of routing efficiency.

major comments (2)

Abstract and experimental results: the central claim that Î(S;W) 'monotonically tracks the generalization gap' is presented without error bars, statistical tests for monotonicity, or ablations on the pretrained CNN experts and the precise form of the selection rule, leaving the robustness of the tracking result unclear.
Method and experiments: the finite expert bank and discrete data-dependent selection rule are used to obtain a tractable empirical posterior q(W|S); however, the paper does not demonstrate that the observed monotonicity persists under continuous softmax gating or with the scale and data complexity of practical MoE systems (e.g., Switch Transformer-style routing), so the load-bearing empirical observation remains tied to the low-dimensional MNIST regime.

minor comments (2)

The Xu-Raginsky bound is invoked for comparison but its precise statement and reference should be stated explicitly in the text or appendix.
The range and discretization of the data-dependence parameter α, together with the number of independent runs used to generate the accuracy-rate curves, should be reported for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: Abstract and experimental results: the central claim that Î(S;W) 'monotonically tracks the generalization gap' is presented without error bars, statistical tests for monotonicity, or ablations on the pretrained CNN experts and the precise form of the selection rule, leaving the robustness of the tracking result unclear.

Authors: We agree that the presentation of the monotonic tracking result would be strengthened by additional statistical support. In the revised manuscript we will add error bars computed over multiple random seeds for both expert pretraining and the data-dependent selection process. We will also report a formal test for monotonicity (Spearman rank correlation with p-value) between Î(S;W) and the generalization gap across the swept values of α. Finally, we will include ablations that vary the number of pretrained CNN experts in the bank and the precise functional form of the selection rule. These additions will appear in the experimental section and updated figures. revision: yes
Referee: Method and experiments: the finite expert bank and discrete data-dependent selection rule are used to obtain a tractable empirical posterior q(W|S); however, the paper does not demonstrate that the observed monotonicity persists under continuous softmax gating or with the scale and data complexity of practical MoE systems (e.g., Switch Transformer-style routing), so the load-bearing empirical observation remains tied to the low-dimensional MNIST regime.

Authors: The finite-bank construction with discrete selection is deliberately introduced to obtain a closed-form estimator of I(S;W) from the empirical posterior q(W|S). This tractability is the central methodological contribution. We do not claim that the specific monotonic relationship extends to continuous softmax gating or to large-scale MoE models; the MNIST setting serves as a controlled, reproducible testbed. We will revise the discussion and conclusion to explicitly state the scope of the current experiments and to list extensions to continuous routing and larger architectures as future work. revision: partial

standing simulated objections not resolved

Demonstration that the observed monotonicity persists under continuous softmax gating or at the scale of practical MoE systems such as Switch Transformers.

Circularity Check

0 steps flagged

No significant circularity; empirical observation on finite MNIST construction is independent

full rationale

The paper constructs a finite expert bank from pretrained CNNs on MNIST together with a discrete data-dependent selection rule controlled by parameter α. It then defines the estimator Î(S;W) via the closed-form discrete entropy expression applied to the empirical posterior q(W|S) over that finite candidate set. As α is swept, both Î(S;W) and the separately computed generalization gap vary, and their monotonic tracking is reported as an observation rather than a mathematical identity. The Blahut-Arimoto procedure is the standard rate-distortion algorithm applied to the same bank to trace the accuracy-rate curve; it does not reduce the main claim to its inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, or imported uniqueness theorems appear. The framework is therefore a self-contained practical tool whose results are tied to the explicit construction without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework rests on standard information-theoretic definitions of mutual information and channel capacity plus the modeling choice that the gate can be treated as a discrete stochastic channel whose output is the expert index. The finite candidate set for W is an explicit modeling assumption that enables the closed-form estimator.

free parameters (1)

α
Data-dependence parameter swept to control how strongly expert selection depends on the input; its values determine the observed tracking behavior.

axioms (2)

domain assumption The gate can be modeled as a stochastic channel with output in a finite expert index set.
Invoked to justify use of I(X;T) and the Blahut-Arimoto procedure.
domain assumption The selected model belongs to a finite candidate set so that I(S;W) admits a closed-form discrete-entropy estimator.
Central modeling step that makes the estimator tractable.

pith-pipeline@v0.9.0 · 5536 in / 1541 out tokens · 63310 ms · 2026-05-08T16:33:55.371665+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating
cs.IT 2026-05 unverdicted novelty 5.0

D-INL reduces training exchange by 70.4% while keeping accuracy within standard deviation of dense INL, with finite-rate regularization cutting estimated latent rate by 45.7% in a distributed classification experiment.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper

[1]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991
[2]

Hierarchical mixtures of experts and the em algorithm,

M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,”Neural Computation, vol. 6, no. 2, pp. 181–214, 1994

work page 1994
[3]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022
[4]

Mixture-of-experts under finite-rate gating: Communication–generalization trade-offs,

A. Khalesi and M. R. D. Salehi, “Mixture-of-experts under finite-rate gating: Communication–generalization trade-offs,”IEEE Communica- tions Letters, 2026

work page 2026
[5]

Information-theoretic analysis of gener- alization capability of learning algorithms,

A. Xu and M. Raginsky, “Information-theoretic analysis of gener- alization capability of learning algorithms,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[6]

Tightening mutual information- based bounds on generalization error,

Y . Bu, S. Zou, and V . V . Veeravalli, “Tightening mutual information- based bounds on generalization error,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020

work page 2020
[7]

Polyanskiy and Y

Y . Polyanskiy and Y . Wu,Information Theory: From Coding to Learn- ing. Cambridge University Press, 2022, draft manuscript

work page 2022
[8]

Fundamental limits of online and distributed learning,

O. Shamir, “Fundamental limits of online and distributed learning,” in International Conference on Machine Learning. PMLR, 2014, pp. 1314–1322

work page 2014
[9]

An information-theoretic online learning principle for specialization in hierarchical decision-making systems,

H. Hihn, S. Gottwald, and D. A. Braun, “An information-theoretic online learning principle for specialization in hierarchical decision-making systems,” in2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019, pp. 3677–3684

work page 2019
[10]

Hierarchically structured task-agnostic continual learning,

H. Hihn and D. A. Braun, “Hierarchically structured task-agnostic continual learning,”Machine Learning, vol. 112, no. 2, pp. 655–686, 2023

work page 2023
[11]

Typical solutions of multi- user linearly-decomposable distributed computing,

A. Khalesi and M. R. Deylam Salehi, “Typical solutions of multi- user linearly-decomposable distributed computing,”IEEE Networking Letters, vol. 8, pp. 10–13, 2026

work page 2026
[12]

Learning- augmented perfectly secure collaborative matrix multiplication,

Z. He, M. R. D. Salehi, D. Malak, and P. A. Stavrou, “Learning- augmented perfectly secure collaborative matrix multiplication,” inProc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2026

work page 2026
[13]

Data dependent risk bounds for hierarchical mixture of experts classifiers,

A. Azran and R. Meir, “Data dependent risk bounds for hierarchical mixture of experts classifiers,” inInternational Conference on Compu- tational Learning Theory. Springer, 2004, pp. 427–441

work page 2004
[14]

Tighter risk bounds for mixtures of experts,

W. Akretche, F. LeBlanc, and M. Marchand, “Tighter risk bounds for mixtures of experts,”arXiv preprint arXiv:2410.10397, 2024

work page arXiv 2024
[15]

Generalization error bounds for noisy, iterative algorithms,

A. Pensia, V . Jog, and P.-L. Loh, “Generalization error bounds for noisy, iterative algorithms,” inProc. IEEE Int. Symp. Inf. Theory (ISIT), 2018, pp. 546–550

work page 2018
[16]

Information-theoretic generalization bounds for SGLD via data- dependent estimates,

J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy, “Information-theoretic generalization bounds for SGLD via data- dependent estimates,” inAdvances in Neural Information Processing Systems(NeurIPS), vol. 32, 2019

work page 2019
[17]

An hypothesis testing approach to information theory,

R. E. Blahut, “An hypothesis testing approach to information theory,” Ph.D. dissertation, Cornell University, 1972

work page 1972
[18]

Mutual information neural estimation,

M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y . Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 531–540

work page 2018
[19]

Local privacy and statistical minimax rates,

J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 2013, pp. 429–438

work page 2013
[20]

An overview of recent progress in the study of distributed multi-agent coordination,

Y . Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,”IEEE Transactions on Industrial Informatics, vol. 9, no. 1, pp. 427–438, 2013

work page 2013
[21]

Accessing from the sky: A tutorial on uav communications for 5g and beyond,

Y . Zeng, Q. Wu, and R. Zhang, “Accessing from the sky: A tutorial on uav communications for 5g and beyond,”Proceedings of the IEEE, vol. 107, no. 12, pp. 2327–2375, 2019

work page 2019

[1] [1]

Adaptive mixtures of local experts,

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,”Neural Computation, vol. 3, no. 1, pp. 79–87, 1991

work page 1991

[2] [2]

Hierarchical mixtures of experts and the em algorithm,

M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,”Neural Computation, vol. 6, no. 2, pp. 181–214, 1994

work page 1994

[3] [3]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022

[4] [4]

Mixture-of-experts under finite-rate gating: Communication–generalization trade-offs,

A. Khalesi and M. R. D. Salehi, “Mixture-of-experts under finite-rate gating: Communication–generalization trade-offs,”IEEE Communica- tions Letters, 2026

work page 2026

[5] [5]

Information-theoretic analysis of gener- alization capability of learning algorithms,

A. Xu and M. Raginsky, “Information-theoretic analysis of gener- alization capability of learning algorithms,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

work page 2017

[6] [6]

Tightening mutual information- based bounds on generalization error,

Y . Bu, S. Zou, and V . V . Veeravalli, “Tightening mutual information- based bounds on generalization error,”IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 121–130, 2020

work page 2020

[7] [7]

Polyanskiy and Y

Y . Polyanskiy and Y . Wu,Information Theory: From Coding to Learn- ing. Cambridge University Press, 2022, draft manuscript

work page 2022

[8] [8]

Fundamental limits of online and distributed learning,

O. Shamir, “Fundamental limits of online and distributed learning,” in International Conference on Machine Learning. PMLR, 2014, pp. 1314–1322

work page 2014

[9] [9]

An information-theoretic online learning principle for specialization in hierarchical decision-making systems,

H. Hihn, S. Gottwald, and D. A. Braun, “An information-theoretic online learning principle for specialization in hierarchical decision-making systems,” in2019 IEEE 58th Conference on Decision and Control (CDC). IEEE, 2019, pp. 3677–3684

work page 2019

[10] [10]

Hierarchically structured task-agnostic continual learning,

H. Hihn and D. A. Braun, “Hierarchically structured task-agnostic continual learning,”Machine Learning, vol. 112, no. 2, pp. 655–686, 2023

work page 2023

[11] [11]

Typical solutions of multi- user linearly-decomposable distributed computing,

A. Khalesi and M. R. Deylam Salehi, “Typical solutions of multi- user linearly-decomposable distributed computing,”IEEE Networking Letters, vol. 8, pp. 10–13, 2026

work page 2026

[12] [12]

Learning- augmented perfectly secure collaborative matrix multiplication,

Z. He, M. R. D. Salehi, D. Malak, and P. A. Stavrou, “Learning- augmented perfectly secure collaborative matrix multiplication,” inProc. IEEE Int. Symp. Inf. Theory (ISIT), Jun. 2026

work page 2026

[13] [13]

Data dependent risk bounds for hierarchical mixture of experts classifiers,

A. Azran and R. Meir, “Data dependent risk bounds for hierarchical mixture of experts classifiers,” inInternational Conference on Compu- tational Learning Theory. Springer, 2004, pp. 427–441

work page 2004

[14] [14]

Tighter risk bounds for mixtures of experts,

W. Akretche, F. LeBlanc, and M. Marchand, “Tighter risk bounds for mixtures of experts,”arXiv preprint arXiv:2410.10397, 2024

work page arXiv 2024

[15] [15]

Generalization error bounds for noisy, iterative algorithms,

A. Pensia, V . Jog, and P.-L. Loh, “Generalization error bounds for noisy, iterative algorithms,” inProc. IEEE Int. Symp. Inf. Theory (ISIT), 2018, pp. 546–550

work page 2018

[16] [16]

Information-theoretic generalization bounds for SGLD via data- dependent estimates,

J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy, “Information-theoretic generalization bounds for SGLD via data- dependent estimates,” inAdvances in Neural Information Processing Systems(NeurIPS), vol. 32, 2019

work page 2019

[17] [17]

An hypothesis testing approach to information theory,

R. E. Blahut, “An hypothesis testing approach to information theory,” Ph.D. dissertation, Cornell University, 1972

work page 1972

[18] [18]

Mutual information neural estimation,

M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y . Bengio, A. Courville, and D. Hjelm, “Mutual information neural estimation,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 531–540

work page 2018

[19] [19]

Local privacy and statistical minimax rates,

J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 2013, pp. 429–438

work page 2013

[20] [20]

An overview of recent progress in the study of distributed multi-agent coordination,

Y . Cao, W. Yu, W. Ren, and G. Chen, “An overview of recent progress in the study of distributed multi-agent coordination,”IEEE Transactions on Industrial Informatics, vol. 9, no. 1, pp. 427–438, 2013

work page 2013

[21] [21]

Accessing from the sky: A tutorial on uav communications for 5g and beyond,

Y . Zeng, Q. Wu, and R. Zhang, “Accessing from the sky: A tutorial on uav communications for 5g and beyond,”Proceedings of the IEEE, vol. 107, no. 12, pp. 2327–2375, 2019

work page 2019