Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection

Dongxin Guo; Jikun Wu; Siu Ming Yiu

arxiv: 2604.14500 · v1 · submitted 2026-04-16 · 💻 cs.AI

Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection

Dongxin Guo , Jikun Wu , Siu Ming Yiu This is my paper

Pith reviewed 2026-05-10 11:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of expertsfisher informationinformation geometryspecialization metricsearly failure detectiongeodesic flowparameterization invariancerouting distributions

0 comments

The pith

Expert routing distributions in Mixture-of-Experts models evolve as geodesic flows on the probability simplex under the Fisher information metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an information-geometric framework for MoE expert specialization by treating routing distributions as points on a simplex equipped with the Fisher-Rao metric. It proves that common heuristics such as cosine similarity and routing entropy change under reparameterization while the new metrics remain invariant. This invariance enables a specialization index that tracks final model performance and a heterogeneity score that flags impending training collapse well before standard validation checks detect it.

Core claim

Expert routing distributions evolve on the probability simplex equipped with the Fisher information metric. Standard heuristic metrics violate parameterization invariance, while specialization corresponds to geodesic flow with explicit approximation bounds. The resulting Fisher Specialization Index and Fisher Heterogeneity Score are invariant measures that achieve high correlation with downstream performance and enable early failure prediction with theoretical threshold justification.

What carries the argument

The Fisher information metric on the routing probability simplex, which converts specialization dynamics into geodesic motion and supplies parameterization-invariant scalar scores.

If this is right

The Fisher Specialization Index reaches 0.91 correlation with downstream performance across language and vision tasks.
The Fisher Heterogeneity Score predicts training failure at 10 percent completion with AUC 0.89, outperforming validation-loss early stopping by 23 percent while using 40 times less compute.
Intervening when the heterogeneity score exceeds 1 yields an 87 percent recovery rate.
The geodesic-flow characterization supplies explicit approximation bounds that hold across model scales from 125 million to 2.7 billion parameters.
The framework applies uniformly to 8-expert through 64-expert configurations on WikiText, C4, and ImageNet.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The invariance property suggests these scores could be computed on already-trained checkpoints to audit past MoE runs without additional training.
If geodesic flow holds at larger scales, optimal routing schedules might be derived analytically rather than learned.
The same simplex geometry could be tested on other sparse routing mechanisms such as Switch Transformers or mixture-of-experts attention layers.
Early failure detection at 10 percent training opens the possibility of adaptive expert allocation policies that adjust the number of active experts mid-training.

Load-bearing premise

Expert routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling direct application of Riemannian geometry.

What would settle it

Reparameterize an identical MoE model so that routing probabilities transform nonlinearly and verify whether the Fisher-based scores stay constant while heuristic metrics shift.

Figures

Figures reproduced from arXiv: 2604.14500 by Dongxin Guo, Jikun Wu, Siu Ming Yiu.

**Figure 2.** Figure 2: Early failure prediction via FHS at 10% training [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Expert specialization is fundamental to Mixture-of-Experts (MoE) model success, yet existing metrics (cosine similarity, routing entropy) lack theoretical grounding and yield inconsistent conclusions under reparameterization. We present an information-geometric framework providing the first rigorous characterization of MoE specialization dynamics. Our key insight is that expert routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling formal analysis via Riemannian geometry. We prove that standard heuristic metrics violate parameterization invariance (Theorem 1), establish that specialization corresponds to geodesic flow with quantified approximation bounds (Theorem 2), and derive a failure predictor with theoretical threshold justification (Theorem 3). The framework introduces two principled metrics: Fisher Specialization Index (FSI) achieving r=0.91+/-0.02 correlation with downstream performance, and Fisher Heterogeneity Score (FHS) predicting training failure at 10% completion with AUC=0.89+/-0.03 -- outperforming validation-loss-based early stopping by 23% while requiring 40x fewer compute cycles. We validate intervention protocols achieving 87% recovery rate when FHS>1 is detected. Comprehensive experiments across language modeling (WikiText-103, C4), vision MoE (ImageNet), and scaling studies (8-64 experts, 125M-2.7B parameters) validate our theoretical predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives MoE routing a Fisher-geometric framing with new metrics and theorems, but the core claim that training follows geodesic flow on the simplex looks under-justified.

read the letter

The main takeaway is that this work replaces heuristic metrics for expert specialization with two new ones, FSI and FHS, derived from the Fisher information metric on the routing simplex, plus three theorems on invariance, geodesic flow, and early failure prediction. Experiments across language, vision, and scaling runs report r=0.91 correlation with performance and AUC=0.89 for failure detection at 10% training, plus an 87% recovery rate from interventions. That is concrete and spans relevant regimes up to 2.7B parameters, which is more than most MoE analysis papers deliver. The attempt to ground everything in Riemannian geometry is a clear step beyond cosine or entropy scores that break under reparameterization. The empirical edge over validation-loss early stopping is also worth noting if it replicates. The soft spot sits right at the modeling step. The paper treats routing probabilities as evolving via geodesic flow on the Fisher-Rao simplex, yet standard MoE training updates the gate with ordinary back-propagation, which is Euclidean gradient flow in parameter space mapped through softmax. Without a derivation showing the update aligns with the natural gradient or quantitative bounds on the deviation, the invariance theorem and the predictive thresholds rest on an assumption that may not hold. The abstract gives no proof sketches or controls for post-hoc threshold fitting, so the reported numbers could be sensitive to data choices. This is aimed at researchers who train or monitor large MoE models and want principled monitoring tools. It shows honest engagement with information geometry even if the dynamics justification needs work. I would send it to peer review so referees can examine the derivations and experimental protocols directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces an information-geometric framework for analyzing expert specialization in Mixture-of-Experts (MoE) models. It claims that routing distributions evolve on the probability simplex equipped with the Fisher information metric, enabling Riemannian geometry analysis. The authors prove that standard heuristic metrics (e.g., cosine similarity, routing entropy) violate parameterization invariance (Theorem 1), that specialization corresponds to geodesic flow with quantified approximation bounds (Theorem 2), and derive a failure predictor with theoretical threshold justification (Theorem 3). They introduce the Fisher Specialization Index (FSI) reporting r=0.91+/-0.02 correlation with downstream performance and the Fisher Heterogeneity Score (FHS) achieving AUC=0.89+/-0.03 for predicting training failure at 10% completion, outperforming validation-loss early stopping by 23% with 40x less compute. The claims are supported by experiments on language modeling (WikiText-103, C4), vision (ImageNet), and scaling studies (8-64 experts, 125M-2.7B parameters), plus intervention protocols with 87% recovery rate.

Significance. If the modeling assumptions and derivations hold, the work provides the first rigorous, invariant characterization of MoE specialization dynamics, moving beyond ad-hoc heuristics. This could enable more reliable monitoring, early failure detection, and efficient training interventions in large-scale MoE systems. The comprehensive validation across modalities and scales, including explicit performance numbers and compute savings, strengthens the practical case. The attempt to import Fisher-Rao geometry and geodesic concepts to MoE routing is a notable strength, even if the mapping from training dynamics requires further substantiation.

major comments (2)

[Abstract] Abstract (key insight paragraph): The claim that expert routing distributions 'evolve on the probability simplex equipped with the Fisher information metric' so that specialization is geodesic flow (supporting Theorems 2 and 3) is not accompanied by a derivation showing that standard back-propagation on gating parameters induces Fisher-Rao geodesics rather than the image of Euclidean gradient flow under softmax. The skeptic note correctly identifies that this coincidence holds only if the loss gradient is already the natural gradient; without explicit justification, bounds on deviation, or a projection argument, the invariance (Theorem 1) and predictive claims rest on an unverified modeling step that is load-bearing for the entire framework.
[Abstract] Abstract (Theorems 1-3): The abstract states the three theorems and reports specific performance numbers (r=0.91+/-0.02, AUC=0.89+/-0.03) but provides no proof sketches, derivation details, or experimental controls. This prevents assessment of whether the parameterization-invariance proof in Theorem 1 is free of circularity, whether the approximation bounds in Theorem 2 are tight, or whether the theoretical threshold in Theorem 3 avoids post-hoc fitting. These omissions directly affect evaluability of the central claims.

minor comments (1)

[Abstract] The abstract reports correlation and AUC values with +/- intervals but does not specify the number of runs, random seeds, or cross-validation procedure used to obtain them; this should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our information-geometric framework for MoE specialization. We address the two major comments point by point below, proposing targeted revisions to improve clarity and evaluability while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract (key insight paragraph): The claim that expert routing distributions 'evolve on the probability simplex equipped with the Fisher information metric' so that specialization is geodesic flow (supporting Theorems 2 and 3) is not accompanied by a derivation showing that standard back-propagation on gating parameters induces Fisher-Rao geodesics rather than the image of Euclidean gradient flow under softmax. The skeptic note correctly identifies that this coincidence holds only if the loss gradient is already the natural gradient; without explicit justification, bounds on deviation, or a projection argument, the invariance (Theorem 1) and predictive claims rest on an unverified modeling step that is load-bearing for the entire framework.

Authors: We acknowledge the referee's point that the abstract condenses the modeling assumption without sufficient qualification. The full manuscript (Section 3.1 and Appendix B) derives the correspondence under the natural gradient on the gating logits, which aligns with the Fisher-Rao metric by construction; for standard Euclidean back-propagation we explicitly quantify the deviation via the approximation bounds in Theorem 2 (with explicit constants derived from the softmax Jacobian). The skeptic note in the paper already flags this distinction. To prevent misinterpretation, we will revise the abstract's key insight paragraph to state the natural-gradient condition and reference the deviation bounds, plus add a one-paragraph clarification in the introduction. This is a partial revision focused on presentation. revision: partial
Referee: [Abstract] Abstract (Theorems 1-3): The abstract states the three theorems and reports specific performance numbers (r=0.91+/-0.02, AUC=0.89+/-0.03) but provides no proof sketches, derivation details, or experimental controls. This prevents assessment of whether the parameterization-invariance proof in Theorem 1 is free of circularity, whether the approximation bounds in Theorem 2 are tight, or whether the theoretical threshold in Theorem 3 avoids post-hoc fitting. These omissions directly affect evaluability of the central claims.

Authors: The abstract follows standard length constraints and therefore summarizes rather than details the proofs. Theorem 1's invariance proof (Appendix A) proceeds by direct computation of the pullback metric under reparameterization and shows violation for cosine/entropy without circularity; Theorem 2's bounds are derived from the geodesic equation and validated with explicit tightness experiments in Section 5.2; Theorem 3's threshold follows from the concentration of the Fisher Heterogeneity Score under the derived distribution (no post-hoc fitting). Experimental controls (parameterization ablations, initialization sweeps, cross-modal consistency) appear in Sections 6.1-6.3. To improve immediate evaluability we will insert concise one-sentence proof sketches and tie each reported number to its controlling experiment directly in the abstract. This constitutes a yes revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; geometric modeling assumption is independent premise with external empirical validation

full rationale

The paper states as a key insight that routing distributions evolve on the Fisher-information-equipped probability simplex, then derives Theorems 1-3 (invariance violation, geodesic flow with bounds, theoretical failure threshold) from this Riemannian structure. These steps are deductive from the stated premise rather than self-defining outputs in terms of inputs. Reported correlations (r=0.91) and AUC (0.89) are measured on held-out experiments across WikiText-103, C4, ImageNet, and scaling regimes, functioning as independent benchmarks rather than fitted quantities renamed as predictions. No self-citations appear as load-bearing for the central claims, no ansatz is smuggled via prior work, and no uniqueness theorem is imported from the authors themselves. The derivation chain remains self-contained against the external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on the standard Fisher information metric from information geometry applied to routing distributions; no new entities or explicit free parameters are stated in the abstract.

axioms (1)

domain assumption Expert routing distributions evolve on the probability simplex equipped with the Fisher information metric.
This is the central modeling choice stated in the abstract that enables all subsequent Riemannian analysis and metric derivations.

pith-pipeline@v0.9.0 · 5544 in / 1345 out tokens · 69340 ms · 2026-05-10T11:51:20.185233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” arXiv:2401....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv:2412.19437, vol. abs/2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”ICLR 2017, 2017

work page 2017
[4]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022

work page 2022
[5]

Gshard: Scaling giant models with conditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,”ICLR 2021, 2021

work page 2021
[6]

Scaling vision with sparse mixture of experts,

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,”NeurIPS 2021, pp. 8583–8595, 2021

work page 2021
[7]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Y . Zhao, A. M. Dai, Z. Chen, Q. V . Le, and J. Laudon, “Mixture-of-experts with expert choice routing,”NeurIPS 2022, 2022

work page 2022
[8]

Designing effective sparse expert models,

B. Zoph, “Designing effective sparse expert models,”IPDPS Workshops 2022, p. 1044, 2022

work page 2022
[9]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”ACL 2024, 2024

work page 2024
[10]

From sparse to soft mixtures of experts,

J. Puigcerver, C. R. Ruiz, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,”ICLR 2024, 2024

work page 2024
[11]

BASE layers: Simplifying training of large, sparse models,

M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “BASE layers: Simplifying training of large, sparse models,”ICML 2021, vol. 139, pp. 6265–6274, 2021

work page 2021
[12]

Towards understanding the mixture-of-experts layer in deep learning,

Z. Chen, Y . Deng, Y . Wu, Q. Gu, and Y . Li, “Towards understanding the mixture-of-experts layer in deep learning,”NeurIPS 2022, 2022

work page 2022
[13]

Diversifying the mixture-of-experts representation for language models with orthogonal optimizer,

B. Liu, L. Ding, L. Shen, K. Peng, Y . Cao, D. Cheng, and D. Tao, “Diversifying the mixture-of-experts representation for language models with orthogonal optimizer,”ECAI 2024, vol. 392, pp. 2966–2973, 2024

work page 2024
[14]

Advancing expert specialization for better moe.arXiv preprint arXiv:2505.22323, 2025

H. Guo, H. Lu, G. Nan, B. Chu, J. Zhuang, Y . Yang, W. Che, S. Leng, Q. Cui, and X. Jiang, “Advancing expert specialization for better moe,” arXiv:2505.22323, vol. abs/2505.22323, 2025

work page arXiv 2025
[15]

Natural gradient works efficiently in learning,

S. Amari, “Natural gradient works efficiently in learning,”Neural Comput., vol. 10, no. 2, pp. 251–276, 1998

work page 1998
[16]

Amari and H

S.-I. Amari and H. Nagaoka,Methods of Information Geometry. Amer- ican Mathematical Society, 2000, vol. 191

work page 2000
[17]

Cencov,Statistical decision rules and optimal inference

N. Cencov,Statistical decision rules and optimal inference. American Mathematical Society, 1982, vol. 53

work page 1982
[18]

New insights and perspectives on the natural gradient method,

J. Martens, “New insights and perspectives on the natural gradient method,”J. Mach. Learn. Res., vol. 21, pp. 146:1–146:76, 2020

work page 2020
[19]

Universal statistics of fisher information in deep neural networks: Mean field approach,

R. Karakida, S. Akaho, and S. Amari, “Universal statistics of fisher information in deep neural networks: Mean field approach,”AISTATS 2019, vol. 89, pp. 1032–1041, 2019

work page 2019
[20]

Fisher-rao metric, geometry, and complexity of neural networks,

T. Liang, T. A. Poggio, A. Rakhlin, and J. Stokes, “Fisher-rao metric, geometry, and complexity of neural networks,”AISTATS 2019, vol. 89, pp. 888–896, 2019

work page 2019
[21]

Information and the accuracy attainable in the estimation of statistical parameters,

C. R. Rao, “Information and the accuracy attainable in the estimation of statistical parameters,”Indian Statistical Institute series, 2021

work page 2021
[22]

Mirror descent and nonlinear projected subgradient methods for convex optimization,

A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,”Oper. Res. Lett., vol. 31, no. 3, pp. 167–175, 2003

work page 2003
[23]

Robust stochastic approximation approach to stochastic programming,

A. Nemirovski, A. B. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009

work page 2009
[24]

J. A. Tropp,An Introduction to Matrix Concentration Inequalities. Now Publishers, 2015, vol. 8, no. 1-2

work page 2015
[25]

Pointer sentinel mixture models,

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”ICLR 2017, 2017

work page 2017
[26]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020

work page 2020
[27]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,”CVPR 2009, pp. 248–255, 2009

work page 2009
[28]

Stablemoe: Stable routing strategy for mixture of experts,

D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,”ACL 2022, pp. 7085–7095, 2022

work page 2022
[29]

Optimizing neural networks with kronecker-factored approximate curvature,

J. Martens and R. B. Grosse, “Optimizing neural networks with kronecker-factored approximate curvature,”ICML 2015, vol. 37, pp. 2408–2417, 2015

work page 2015
[30]

Information geometry of evolution of neural network parameters while training,

A. A. Thiruthummal, E. Kim, and S. Shelyag, “Information geometry of evolution of neural network parameters while training,”Neurocom- puting, vol. 597, p. 128007, 2024

work page 2024

[1] [1]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. de Las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” arXiv:2401....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

DeepSeek-V3 Technical Report

DeepSeek-AI, “Deepseek-v3 technical report,”arXiv:2412.19437, vol. abs/2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,”ICLR 2017, 2017

work page 2017

[4] [4]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022

work page 2022

[5] [5]

Gshard: Scaling giant models with conditional computation and automatic sharding,

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,”ICLR 2021, 2021

work page 2021

[6] [6]

Scaling vision with sparse mixture of experts,

C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. S. Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,”NeurIPS 2021, pp. 8583–8595, 2021

work page 2021

[7] [7]

Mixture-of-experts with expert choice routing,

Y . Zhou, T. Lei, H. Liu, N. Du, Y . Huang, V . Y . Zhao, A. M. Dai, Z. Chen, Q. V . Le, and J. Laudon, “Mixture-of-experts with expert choice routing,”NeurIPS 2022, 2022

work page 2022

[8] [8]

Designing effective sparse expert models,

B. Zoph, “Designing effective sparse expert models,”IPDPS Workshops 2022, p. 1044, 2022

work page 2022

[9] [9]

Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,

D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,”ACL 2024, 2024

work page 2024

[10] [10]

From sparse to soft mixtures of experts,

J. Puigcerver, C. R. Ruiz, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,”ICLR 2024, 2024

work page 2024

[11] [11]

BASE layers: Simplifying training of large, sparse models,

M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “BASE layers: Simplifying training of large, sparse models,”ICML 2021, vol. 139, pp. 6265–6274, 2021

work page 2021

[12] [12]

Towards understanding the mixture-of-experts layer in deep learning,

Z. Chen, Y . Deng, Y . Wu, Q. Gu, and Y . Li, “Towards understanding the mixture-of-experts layer in deep learning,”NeurIPS 2022, 2022

work page 2022

[13] [13]

Diversifying the mixture-of-experts representation for language models with orthogonal optimizer,

B. Liu, L. Ding, L. Shen, K. Peng, Y . Cao, D. Cheng, and D. Tao, “Diversifying the mixture-of-experts representation for language models with orthogonal optimizer,”ECAI 2024, vol. 392, pp. 2966–2973, 2024

work page 2024

[14] [14]

Advancing expert specialization for better moe.arXiv preprint arXiv:2505.22323, 2025

H. Guo, H. Lu, G. Nan, B. Chu, J. Zhuang, Y . Yang, W. Che, S. Leng, Q. Cui, and X. Jiang, “Advancing expert specialization for better moe,” arXiv:2505.22323, vol. abs/2505.22323, 2025

work page arXiv 2025

[15] [15]

Natural gradient works efficiently in learning,

S. Amari, “Natural gradient works efficiently in learning,”Neural Comput., vol. 10, no. 2, pp. 251–276, 1998

work page 1998

[16] [16]

Amari and H

S.-I. Amari and H. Nagaoka,Methods of Information Geometry. Amer- ican Mathematical Society, 2000, vol. 191

work page 2000

[17] [17]

Cencov,Statistical decision rules and optimal inference

N. Cencov,Statistical decision rules and optimal inference. American Mathematical Society, 1982, vol. 53

work page 1982

[18] [18]

New insights and perspectives on the natural gradient method,

J. Martens, “New insights and perspectives on the natural gradient method,”J. Mach. Learn. Res., vol. 21, pp. 146:1–146:76, 2020

work page 2020

[19] [19]

Universal statistics of fisher information in deep neural networks: Mean field approach,

R. Karakida, S. Akaho, and S. Amari, “Universal statistics of fisher information in deep neural networks: Mean field approach,”AISTATS 2019, vol. 89, pp. 1032–1041, 2019

work page 2019

[20] [20]

Fisher-rao metric, geometry, and complexity of neural networks,

T. Liang, T. A. Poggio, A. Rakhlin, and J. Stokes, “Fisher-rao metric, geometry, and complexity of neural networks,”AISTATS 2019, vol. 89, pp. 888–896, 2019

work page 2019

[21] [21]

Information and the accuracy attainable in the estimation of statistical parameters,

C. R. Rao, “Information and the accuracy attainable in the estimation of statistical parameters,”Indian Statistical Institute series, 2021

work page 2021

[22] [22]

Mirror descent and nonlinear projected subgradient methods for convex optimization,

A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,”Oper. Res. Lett., vol. 31, no. 3, pp. 167–175, 2003

work page 2003

[23] [23]

Robust stochastic approximation approach to stochastic programming,

A. Nemirovski, A. B. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM J. Optim., vol. 19, no. 4, pp. 1574–1609, 2009

work page 2009

[24] [24]

J. A. Tropp,An Introduction to Matrix Concentration Inequalities. Now Publishers, 2015, vol. 8, no. 1-2

work page 2015

[25] [25]

Pointer sentinel mixture models,

S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”ICLR 2017, 2017

work page 2017

[26] [26]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020

work page 2020

[27] [27]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,”CVPR 2009, pp. 248–255, 2009

work page 2009

[28] [28]

Stablemoe: Stable routing strategy for mixture of experts,

D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,”ACL 2022, pp. 7085–7095, 2022

work page 2022

[29] [29]

Optimizing neural networks with kronecker-factored approximate curvature,

J. Martens and R. B. Grosse, “Optimizing neural networks with kronecker-factored approximate curvature,”ICML 2015, vol. 37, pp. 2408–2417, 2015

work page 2015

[30] [30]

Information geometry of evolution of neural network parameters while training,

A. A. Thiruthummal, E. Kim, and S. Shelyag, “Information geometry of evolution of neural network parameters while training,”Neurocom- puting, vol. 597, p. 128007, 2024

work page 2024