pith. sign in

arxiv: 2605.29121 · v1 · pith:TIB4MVMRnew · submitted 2026-05-27 · 🧮 math.DS · cs.AI· cs.LG

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Pith reviewed 2026-06-29 09:25 UTC · model grok-4.3

classification 🧮 math.DS cs.AIcs.LG
keywords mixture of expertssoftmax routingload imbalancepitchfork bifurcationcusp catastrophemean-field limitdynamical systemsbifurcation analysis
0
0 comments X

The pith

A mean-field model of two-expert softmax routing exhibits a supercritical pitchfork bifurcation to load imbalance above a critical feedback strength.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a minimal dynamical system as the mean-field limit of a reinforcement rule where selected experts gain score increments and all scores decay. In the symmetric case this system undergoes a supercritical pitchfork bifurcation: weak feedback keeps a unique stable balanced load state while stronger feedback produces two stable asymmetric states. Adding external asymmetry unfolds the pitchfork into fold bifurcations that form a cusp in parameter space. Exact parametric equations are derived for the bifurcation set and the cusp normal form. Numerical simulations link the model to observed load imbalance in small MoE models and classification tasks.

Core claim

In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. Exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe are derived.

What carries the argument

the mean-field limit of the discrete reinforcement rule for expert scores, which produces a two-dimensional dynamical system whose equilibria and stability are analyzed via bifurcation theory

If this is right

  • Below the critical feedback strength the router maintains balanced expert utilization.
  • Above the critical value the system can spontaneously settle into one of two imbalanced states.
  • External input asymmetries replace the pitchfork with a cusp catastrophe, introducing regions of hysteresis between balanced and imbalanced loads.
  • The model provides a low-dimensional explanation for abrupt transitions to load imbalance observed in adaptive MoE routers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar reinforcement dynamics might appear in other routing or selection mechanisms beyond MoE, such as in neural network pruning or resource allocation.
  • Testing the predicted cusp shape in larger MoE models could confirm whether the two-expert minimal case captures the dominant instability mechanism.
  • Control strategies that modulate the feedback strength or add explicit balancing terms could be designed to keep the system below the bifurcation threshold.

Load-bearing premise

The discrete reinforcement rule possesses a well-defined mean-field limit whose long-term behavior accurately represents the load dynamics of actual discrete softmax routing in MoE layers.

What would settle it

A controlled experiment in a two-expert MoE layer where the feedback strength parameter is gradually increased and a sudden transition from balanced to imbalanced expert loads is observed at the predicted critical value.

Figures

Figures reproduced from arXiv: 2605.29121 by Innopolis, O. M. Kiselev (Innopolis University, Russia).

Figure 1
Figure 1. Figure 1: A small trainable MoE model with an input-dependent router. In the central region the [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pilot PyTorch MoE with hard top-1 routing. Top: empirical hard expert load. Middle: [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of a load-balancing penalty on a trainable hard top-1 MoE at fixed bias [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hard top-1 MoE on digits. Top: absolute test load imbalance. Bottom: test accu￾racy. Without balancing, external bias increases collapse and degrades performance; a load-balancing penalty keeps the load closer to symmetry. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scan over the load-balancing strength on [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Batch token routing and the mean-field limit. The shaded band shows one standard [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Hysteresis of empirical load in the two-expert batch router. The vertical axis shows [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Final empirical load imbalance at h = 0 as a function of softmax temperature and regular￾ization. The dashed line is the mean-field threshold a = 2γT. 1.0 1.2 1.4 1.6 1.8 2.0 2.2 analytic threshold Tc = a/(2 ) 1.0 1.2 1.4 1.6 1.8 2.0 2.2 finite-tim e o nset fro m fin al |u| = 0.1 = 0.7 = 0.9 = 1.1 = 1.3 = 1.5 mean-field threshold [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Test of the temperature threshold. Points show the finite-time onset of load collapse in [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hysteresis width as a function of positive feedback strength [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Suppression of hysteresis by negative feedback on load. The dashed line is the prediction [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a minimal dynamical model of adaptive softmax routing in a two-expert MoE layer, obtained as the mean-field limit of a discrete reinforcement rule in which the selected expert receives a small score increment while all scores undergo regularizing decay. In the symmetric case the limiting ODE exhibits a supercritical pitchfork bifurcation separating a unique stable balanced state from a pair of stable asymmetric states above a critical feedback strength. An external asymmetry unfolds the pitchfork into a cusp catastrophe; the authors derive exact parametric equations for the bifurcation set and the local normal form. Numerical experiments connect the bifurcation diagram to empirical expert loads, a trainable MoE model, hard top-1 routing, and a digit-classification task.

Significance. If the mean-field limit is rigorously justified, the work supplies a low-dimensional, analytically tractable mechanism that explains abrupt transitions to load imbalance in adaptive MoE routers. The explicit parametric description of the cusp and its normal form constitutes a concrete analytical contribution that could be used for stability analysis or router design. The multi-scale numerical validation (empirical loads, small MoE, PyTorch routing, classification) is a positive feature.

major comments (2)
  1. [Model derivation / §2] The central claim that the analyzed ODE is the mean-field limit of the discrete reinforcement rule is load-bearing for every subsequent bifurcation statement, yet the manuscript provides no explicit derivation. No scaling regime, stochastic-approximation steps, or convergence estimates (e.g., as increment size → 0) appear in the model-construction section; the abstract simply states that the system “is obtained as” the limit. Without this step the pitchfork and cusp analyses apply only to an unverified continuous proxy.
  2. [Numerical experiments / §4] The claim that the long-term attractors of the discrete process are accurately represented by the ODE attractors is asserted but not verified. No error bounds, numerical convergence tests, or comparison of discrete trajectories to the ODE flow as the increment parameter vanishes are reported, undermining the link between the bifurcation diagram and the “empirical expert load” experiments.
minor comments (2)
  1. [§2] Notation for the score vector and the decay rate is introduced without a compact table of symbols; a short nomenclature table would improve readability.
  2. [Figures 2–4] Figure captions for the bifurcation diagrams should explicitly state the numerical values of the fixed parameters used to generate each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive report. The two major comments correctly identify gaps in the presentation of the mean-field derivation and its numerical validation. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Model derivation / §2] The central claim that the analyzed ODE is the mean-field limit of the discrete reinforcement rule is load-bearing for every subsequent bifurcation statement, yet the manuscript provides no explicit derivation. No scaling regime, stochastic-approximation steps, or convergence estimates (e.g., as increment size → 0) appear in the model-construction section; the abstract simply states that the system “is obtained as” the limit. Without this step the pitchfork and cusp analyses apply only to an unverified continuous proxy.

    Authors: We agree that the manuscript does not contain an explicit derivation of the ODE as the mean-field limit. In the revised version we will insert a new subsection in §2 that derives the continuous limit from the discrete reinforcement rule via stochastic approximation. The derivation will specify the scaling regime (increment size ε → 0 with time scaled by 1/ε), state the associated martingale and averaging arguments, and cite standard convergence theorems for such processes. This will make the subsequent bifurcation analysis rest on a rigorously justified ODE. revision: yes

  2. Referee: [Numerical experiments / §4] The claim that the long-term attractors of the discrete process are accurately represented by the ODE attractors is asserted but not verified. No error bounds, numerical convergence tests, or comparison of discrete trajectories to the ODE flow as the increment parameter vanishes are reported, undermining the link between the bifurcation diagram and the “empirical expert load” experiments.

    Authors: We acknowledge that the current numerical section asserts rather than demonstrates convergence. The revision will add a dedicated convergence study in §4: for a sequence of decreasing increment sizes we will plot sample paths of the discrete process against the ODE flow, report the distance between their long-term attractors, and supply quantitative error bounds. These tests will directly corroborate the link between the bifurcation diagram and the reported empirical load statistics. revision: yes

Circularity Check

0 steps flagged

No circularity; mean-field ODE derived from discrete rule before bifurcation analysis

full rationale

The paper states that the continuous model is obtained as the mean-field limit of an explicit discrete reinforcement rule (selected expert increment plus decay), then analyzes the resulting ODE for its pitchfork and cusp bifurcations. No parameters are fitted to the bifurcation diagram itself, no self-citations load-bear the central claims, and numerical experiments on discrete routers serve as separate validation. The derivation chain therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on the mean-field limit of the discrete reinforcement rule; no free parameters are explicitly fitted in the abstract, and no new entities are postulated.

axioms (1)
  • domain assumption The discrete reinforcement rule for expert scores possesses a well-defined mean-field limit whose equilibria and stability capture the long-term load behavior of the discrete router.
    This limit is invoked to obtain the continuous dynamical system whose bifurcations are analyzed.

pith-pipeline@v0.9.1-grok · 5712 in / 1462 out tokens · 40876 ms · 2026-06-29T09:25:47.646449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages · 7 internal anchors

  1. [1]

    R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local ex- perts.Neural Computation, 3(1):79–87, 1991.https://direct.mit.edu/neco/article/3/1/ 79/5560/Adaptive-Mixtures-of-Local-Experts

  2. [2]

    M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM algorithm.Neural Computation, 6(2):181–214, 1994.https://direct.mit.edu/neco/article/6/2/181/5779/ Hierarchical-Mixtures-of-Experts-and-the-EM

  3. [3]

    Kang and J.-H

    K. Kang and J.-H. Oh. Statistical mechanics of the mixture of experts. InAdvances in Neural Information Processing Systems 9, pages 183–189, 1996.https://papers.nips.cc/paper/ 1176-statistical-mechanics-of-the-mixture-of-experts

  4. [4]

    W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 37(7):3896– 3915, 2025. doi:10.1109/TKDE.2025.3554028.https://arxiv.org/abs/2407.06204

  5. [5]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outra- geously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017.https://arxiv.org/abs/1701.06538

  6. [6]

    B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. https://arxiv.org/abs/2202.08906

  7. [7]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022. https://www.jmlr.org/papers/v23/21-0998.html

  8. [8]

    N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui. GLaM: Efficient scaling of language models with mixture-of-experts. InProceedings...

  9. [9]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. GShard: Scaling giant models with conditional computation and automatic sharding. InInter- national Conference on Learning Representations, 2021.https://arxiv.org/abs/2006.16668

  10. [10]

    Lewis, S

    M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer. BASE layers: Simplifying training of large, sparse models. InProceedings of the 38th International Conference on Machine Learning, PMLR 139:6265–6274, 2021.https://arxiv.org/abs/2103.16716

  11. [11]

    Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Z. Chen, Q. V. Le, and J. Laudon. Mixture-of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35, 2022.https://arxiv.org/abs/2202.09368

  12. [12]

    L. Wang, H. Gao, C. Zhao, X. Sun, and D. Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664, 2024.https://arxiv.org/abs/2408. 15664

  13. [13]

    Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

    C. Mouzouni. Three phases of expert routing: How load balance evolves during mixture-of- experts training.arXiv preprint arXiv:2604.04230, 2026.https://arxiv.org/abs/2604.04230

  14. [14]

    Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

    R. Rastegar. Soft-to-Hard Routing in Sparse Mixture-of-Experts Models.arXiv preprint arXiv:2605.02124, 2026.https://arxiv.org/abs/2605.02124

  15. [15]

    Guckenheimer and P

    J. Guckenheimer and P. Holmes.Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields. Springer, 1983

  16. [16]

    Y. A. Kuznetsov.Elements of Applied Bifurcation Theory. 3rd edition, Springer, 2004

  17. [17]

    S. N. Ethier and T. G. Kurtz.Markov Processes: Characterization and Convergence. Wiley, 1986

  18. [18]

    V. S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press, 2008

  19. [19]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Y. Bengio, N. Leonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013.https://arxiv. org/abs/1308.3432

  20. [20]

    Paszke, S

    A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Te- jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. InAdvances in Neural Information Processing Sy...

  21. [21]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret- tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per- rot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.https://www.jmlr.org/papers/v12/pedregosa11a.html 21