A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Innopolis; O. M. Kiselev (Innopolis University; Russia)

arxiv: 2605.29121 · v1 · pith:TIB4MVMRnew · submitted 2026-05-27 · 🧮 math.DS · cs.AI· cs.LG

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

O. M. Kiselev (Innopolis University , Innopolis , Russia) This is my paper

Pith reviewed 2026-06-29 09:25 UTC · model grok-4.3

classification 🧮 math.DS cs.AIcs.LG

keywords mixture of expertssoftmax routingload imbalancepitchfork bifurcationcusp catastrophemean-field limitdynamical systemsbifurcation analysis

0 comments

The pith

A mean-field model of two-expert softmax routing exhibits a supercritical pitchfork bifurcation to load imbalance above a critical feedback strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a minimal dynamical system as the mean-field limit of a reinforcement rule where selected experts gain score increments and all scores decay. In the symmetric case this system undergoes a supercritical pitchfork bifurcation: weak feedback keeps a unique stable balanced load state while stronger feedback produces two stable asymmetric states. Adding external asymmetry unfolds the pitchfork into fold bifurcations that form a cusp in parameter space. Exact parametric equations are derived for the bifurcation set and the cusp normal form. Numerical simulations link the model to observed load imbalance in small MoE models and classification tasks.

Core claim

In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. Exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe are derived.

What carries the argument

the mean-field limit of the discrete reinforcement rule for expert scores, which produces a two-dimensional dynamical system whose equilibria and stability are analyzed via bifurcation theory

If this is right

Below the critical feedback strength the router maintains balanced expert utilization.
Above the critical value the system can spontaneously settle into one of two imbalanced states.
External input asymmetries replace the pitchfork with a cusp catastrophe, introducing regions of hysteresis between balanced and imbalanced loads.
The model provides a low-dimensional explanation for abrupt transitions to load imbalance observed in adaptive MoE routers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reinforcement dynamics might appear in other routing or selection mechanisms beyond MoE, such as in neural network pruning or resource allocation.
Testing the predicted cusp shape in larger MoE models could confirm whether the two-expert minimal case captures the dominant instability mechanism.
Control strategies that modulate the feedback strength or add explicit balancing terms could be designed to keep the system below the bifurcation threshold.

Load-bearing premise

The discrete reinforcement rule possesses a well-defined mean-field limit whose long-term behavior accurately represents the load dynamics of actual discrete softmax routing in MoE layers.

What would settle it

A controlled experiment in a two-expert MoE layer where the feedback strength parameter is gradually increased and a sudden transition from balanced to imbalanced expert loads is observed at the predicted critical value.

Figures

Figures reproduced from arXiv: 2605.29121 by Innopolis, O. M. Kiselev (Innopolis University, Russia).

**Figure 2.** Figure 2: Pilot PyTorch MoE with hard top-1 routing. Top: empirical hard expert load. Middle: [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of a load-balancing penalty on a trainable hard top-1 MoE at fixed bias [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Hard top-1 MoE on digits. Top: absolute test load imbalance. Bottom: test accuracy. Without balancing, external bias increases collapse and degrades performance; a load-balancing penalty keeps the load closer to symmetry. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Scan over the load-balancing strength on [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Batch token routing and the mean-field limit. The shaded band shows one standard [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Hysteresis of empirical load in the two-expert batch router. The vertical axis shows [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Final empirical load imbalance at h = 0 as a function of softmax temperature and regularization. The dashed line is the mean-field threshold a = 2γT. 1.0 1.2 1.4 1.6 1.8 2.0 2.2 analytic threshold Tc = a/(2 ) 1.0 1.2 1.4 1.6 1.8 2.0 2.2 finite-tim e o nset fro m fin al |u| = 0.1 = 0.7 = 0.9 = 1.1 = 1.3 = 1.5 mean-field threshold [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Test of the temperature threshold. Points show the finite-time onset of load collapse in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Hysteresis width as a function of positive feedback strength [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Suppression of hysteresis by negative feedback on load. The dashed line is the prediction [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean pitchfork-to-cusp story for two-expert MoE load imbalance but asserts the mean-field limit without showing the derivation or convergence.

read the letter

The core contribution is a two-dimensional ODE obtained from a reinforcement rule on expert scores: the chosen expert gets a small increment while all scores decay. In the symmetric case the ODE undergoes a supercritical pitchfork at a critical feedback strength, so the balanced fixed point loses stability and two asymmetric attractors appear. Adding an external bias unfolds the pitchfork into a cusp with explicit parametric equations for the bifurcation curves and the local normal form. Numerical runs on a small trainable MoE, hard top-1 PyTorch routing, and a digit classifier are used to illustrate the transition.

The work is useful because it supplies an explicit low-dimensional mechanism that can produce abrupt load imbalance, and the bifurcation calculations are standard and reproducible once the ODE is accepted. The connection to actual router behavior is at least attempted through the small-scale experiments.

The main gap is the missing justification for the mean-field step. The abstract states that the ODE “is obtained as” the limit of the discrete rule, yet no scaling, stochastic-approximation argument, or error bound is supplied to show that the long-term statistics of the discrete process converge to the ODE attractors as the increment size goes to zero. Without that link the bifurcation analysis applies only to the continuous proxy. The experiments remain on toy models, so they do not test whether the predicted critical value appears at realistic scale.

This is worth sending to referees for readers interested in dynamical models of routing or training stability. The thinking is coherent and the math is carried through once the ODE is granted, so a serious editor should let it go to review even if the convergence claim needs work.

Referee Report

2 major / 2 minor

Summary. The paper proposes a minimal dynamical model of adaptive softmax routing in a two-expert MoE layer, obtained as the mean-field limit of a discrete reinforcement rule in which the selected expert receives a small score increment while all scores undergo regularizing decay. In the symmetric case the limiting ODE exhibits a supercritical pitchfork bifurcation separating a unique stable balanced state from a pair of stable asymmetric states above a critical feedback strength. An external asymmetry unfolds the pitchfork into a cusp catastrophe; the authors derive exact parametric equations for the bifurcation set and the local normal form. Numerical experiments connect the bifurcation diagram to empirical expert loads, a trainable MoE model, hard top-1 routing, and a digit-classification task.

Significance. If the mean-field limit is rigorously justified, the work supplies a low-dimensional, analytically tractable mechanism that explains abrupt transitions to load imbalance in adaptive MoE routers. The explicit parametric description of the cusp and its normal form constitutes a concrete analytical contribution that could be used for stability analysis or router design. The multi-scale numerical validation (empirical loads, small MoE, PyTorch routing, classification) is a positive feature.

major comments (2)

[Model derivation / §2] The central claim that the analyzed ODE is the mean-field limit of the discrete reinforcement rule is load-bearing for every subsequent bifurcation statement, yet the manuscript provides no explicit derivation. No scaling regime, stochastic-approximation steps, or convergence estimates (e.g., as increment size → 0) appear in the model-construction section; the abstract simply states that the system “is obtained as” the limit. Without this step the pitchfork and cusp analyses apply only to an unverified continuous proxy.
[Numerical experiments / §4] The claim that the long-term attractors of the discrete process are accurately represented by the ODE attractors is asserted but not verified. No error bounds, numerical convergence tests, or comparison of discrete trajectories to the ODE flow as the increment parameter vanishes are reported, undermining the link between the bifurcation diagram and the “empirical expert load” experiments.

minor comments (2)

[§2] Notation for the score vector and the decay rate is introduced without a compact table of symbols; a short nomenclature table would improve readability.
[Figures 2–4] Figure captions for the bifurcation diagrams should explicitly state the numerical values of the fixed parameters used to generate each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments correctly identify gaps in the presentation of the mean-field derivation and its numerical validation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Model derivation / §2] The central claim that the analyzed ODE is the mean-field limit of the discrete reinforcement rule is load-bearing for every subsequent bifurcation statement, yet the manuscript provides no explicit derivation. No scaling regime, stochastic-approximation steps, or convergence estimates (e.g., as increment size → 0) appear in the model-construction section; the abstract simply states that the system “is obtained as” the limit. Without this step the pitchfork and cusp analyses apply only to an unverified continuous proxy.

Authors: We agree that the manuscript does not contain an explicit derivation of the ODE as the mean-field limit. In the revised version we will insert a new subsection in §2 that derives the continuous limit from the discrete reinforcement rule via stochastic approximation. The derivation will specify the scaling regime (increment size ε → 0 with time scaled by 1/ε), state the associated martingale and averaging arguments, and cite standard convergence theorems for such processes. This will make the subsequent bifurcation analysis rest on a rigorously justified ODE. revision: yes
Referee: [Numerical experiments / §4] The claim that the long-term attractors of the discrete process are accurately represented by the ODE attractors is asserted but not verified. No error bounds, numerical convergence tests, or comparison of discrete trajectories to the ODE flow as the increment parameter vanishes are reported, undermining the link between the bifurcation diagram and the “empirical expert load” experiments.

Authors: We acknowledge that the current numerical section asserts rather than demonstrates convergence. The revision will add a dedicated convergence study in §4: for a sequence of decreasing increment sizes we will plot sample paths of the discrete process against the ODE flow, report the distance between their long-term attractors, and supply quantitative error bounds. These tests will directly corroborate the link between the bifurcation diagram and the reported empirical load statistics. revision: yes

Circularity Check

0 steps flagged

No circularity; mean-field ODE derived from discrete rule before bifurcation analysis

full rationale

The paper states that the continuous model is obtained as the mean-field limit of an explicit discrete reinforcement rule (selected expert increment plus decay), then analyzes the resulting ODE for its pitchfork and cusp bifurcations. No parameters are fitted to the bifurcation diagram itself, no self-citations load-bear the central claims, and numerical experiments on discrete routers serve as separate validation. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on the mean-field limit of the discrete reinforcement rule; no free parameters are explicitly fitted in the abstract, and no new entities are postulated.

axioms (1)

domain assumption The discrete reinforcement rule for expert scores possesses a well-defined mean-field limit whose equilibria and stability capture the long-term load behavior of the discrete router.
This limit is invoked to obtain the continuous dynamical system whose bifurcations are analyzed.

pith-pipeline@v0.9.1-grok · 5712 in / 1462 out tokens · 40876 ms · 2026-06-29T09:25:47.646449+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages · 7 internal anchors

[1]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local ex- perts.Neural Computation, 3(1):79–87, 1991.https://direct.mit.edu/neco/article/3/1/ 79/5560/Adaptive-Mixtures-of-Local-Experts

1991
[2]

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994.https://direct.mit.edu/neco/article/6/2/181/5779/ Hierarchical-Mixtures-of-Experts-and-the-EM

1994
[3]

Kang and J.-H

K. Kang and J.-H. Oh. Statistical mechanics of the mixture of experts. InAdvances in Neural Information Processing Systems 9, pages 183–189, 1996.https://papers.nips.cc/paper/ 1176-statistical-mechanics-of-the-mixture-of-experts

1996
[4]

W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 37(7):3896– 3915, 2025. doi:10.1109/TKDE.2025.3554028.https://arxiv.org/abs/2407.06204

work page doi:10.1109/tkde.2025.3554028.https://arxiv.org/abs/2407.06204 2025
[5]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017.https://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. https://www.jmlr.org/papers/v23/21-0998.html

2022
[8]

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui. GLaM: Efficient scaling of language models with mixture-of-experts. InProceedings...

2022
[9]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInter- national Conference on Learning Representations, 2021.https://arxiv.org/abs/2006.16668

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Lewis, S

M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, PMLR 139:6265–6274, 2021.https://arxiv.org/abs/2103.16716

work page arXiv 2021
[11]

Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Z. Chen, Q. V. Le, and J. Laudon. Mixture-of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35, 2022.https://arxiv.org/abs/2202.09368

work page arXiv 2022
[12]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024.https://arxiv.org/abs/2408. 15664

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

C. Mouzouni. Three phases of expert routing: How load balance evolves during mixture-of- experts training.arXiv preprint arXiv:2604.04230, 2026.https://arxiv.org/abs/2604.04230

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

R. Rastegar. Soft-to-Hard Routing in Sparse Mixture-of-Experts Models.arXiv preprint arXiv:2605.02124, 2026.https://arxiv.org/abs/2605.02124

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Guckenheimer and P

J. Guckenheimer and P. Holmes.Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, 1983

1983
[16]

Y. A. Kuznetsov.Elements of Applied Bifurcation Theory. 3rd edition, Springer, 2004

2004
[17]

S. N. Ethier and T. G. Kurtz.Markov Processes: Characterization and Convergence. Wiley, 1986

1986
[18]

V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008

2008
[19]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013.https://arxiv. org/abs/1308.3432

work page internal anchor Pith review Pith/arXiv arXiv 2013
[20]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Sy...

2019
[21]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.https://www.jmlr.org/papers/v12/pedregosa11a.html 21

2011

[1] [1]

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local ex- perts.Neural Computation, 3(1):79–87, 1991.https://direct.mit.edu/neco/article/3/1/ 79/5560/Adaptive-Mixtures-of-Local-Experts

1991

[2] [2]

M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994.https://direct.mit.edu/neco/article/6/2/181/5779/ Hierarchical-Mixtures-of-Experts-and-the-EM

1994

[3] [3]

Kang and J.-H

K. Kang and J.-H. Oh. Statistical mechanics of the mixture of experts. InAdvances in Neural Information Processing Systems 9, pages 183–189, 1996.https://papers.nips.cc/paper/ 1176-statistical-mechanics-of-the-mixture-of-experts

1996

[4] [4]

W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 37(7):3896– 3915, 2025. doi:10.1109/TKDE.2025.3554028.https://arxiv.org/abs/2407.06204

work page doi:10.1109/tkde.2025.3554028.https://arxiv.org/abs/2407.06204 2025

[5] [5]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017.https://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. https://www.jmlr.org/papers/v23/21-0998.html

2022

[8] [8]

N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui. GLaM: Efficient scaling of language models with mixture-of-experts. InProceedings...

2022

[9] [9]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInter- national Conference on Learning Representations, 2021.https://arxiv.org/abs/2006.16668

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Lewis, S

M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, PMLR 139:6265–6274, 2021.https://arxiv.org/abs/2103.16716

work page arXiv 2021

[11] [11]

Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Z. Chen, Q. V. Le, and J. Laudon. Mixture-of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35, 2022.https://arxiv.org/abs/2202.09368

work page arXiv 2022

[12] [12]

L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024.https://arxiv.org/abs/2408. 15664

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

C. Mouzouni. Three phases of expert routing: How load balance evolves during mixture-of- experts training.arXiv preprint arXiv:2604.04230, 2026.https://arxiv.org/abs/2604.04230

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

R. Rastegar. Soft-to-Hard Routing in Sparse Mixture-of-Experts Models.arXiv preprint arXiv:2605.02124, 2026.https://arxiv.org/abs/2605.02124

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Guckenheimer and P

J. Guckenheimer and P. Holmes.Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, 1983

1983

[16] [16]

Y. A. Kuznetsov.Elements of Applied Bifurcation Theory. 3rd edition, Springer, 2004

2004

[17] [17]

S. N. Ethier and T. G. Kurtz.Markov Processes: Characterization and Convergence. Wiley, 1986

1986

[18] [18]

V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008

2008

[19] [19]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013.https://arxiv. org/abs/1308.3432

work page internal anchor Pith review Pith/arXiv arXiv 2013

[20] [20]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Sy...

2019

[21] [21]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.https://www.jmlr.org/papers/v12/pedregosa11a.html 21

2011