pith. sign in

arxiv: 2605.15403 · v1 · pith:OZUPR4BVnew · submitted 2026-05-14 · 💻 cs.LG · math.OC· stat.ML

φ-Balancing for Mixture-of-Experts Training

Pith reviewed 2026-05-19 16:11 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords mixture of expertsload balancingexpert utilizationconvex optimizationmirror descentdeep learningmodel scaling
0
0 comments X

The pith

Mixture-of-experts models achieve population-level expert balance by minimizing a strictly convex potential of the expected routing distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing load-balancing methods in Mixture-of-Experts models rely on noisy mini-batch assignment counts that bias the system away from true population-level objectives. The paper proposes φ-balancing to minimize a strictly convex, symmetric, and differentiable potential directly on the expected routing distribution instead. Convex duality converts the problem into a min-max saddle-point task, which mirror descent solves through a simple exponential moving average update on the routing logits. This adjustment carries negligible overhead yet produces more stable expert activation than Switch-style or loss-free baselines during both pretraining and downstream fine-tuning. A reader would care because balanced utilization lets the model deploy its full parameter count without leaving experts idle.

Core claim

φ-balancing directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, this yields an equivalent min-max formulation solved by a simple online algorithm based on mirror descent, resulting in an efficient EMA-based routing adjustment that consistently improves stability and effectiveness of expert utilization across large-scale pretraining and fine-tuning.

What carries the argument

The φ-potential, a strictly convex symmetric differentiable function of the population expected routing distribution, whose dual produces the min-max problem solved by mirror descent on routing logits with EMA tracking.

If this is right

  • Expert activation patterns become more stable across training steps because mini-batch noise is replaced by a population-level target.
  • Large-scale pretraining and downstream fine-tuning exhibit higher effective parameter usage without extra compute cost.
  • The same balancing objective can be applied at every MoE layer independently while preserving differentiability.
  • Routing decisions remain compatible with top-k selection and standard loss functions used in existing MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The convex-potential approach could be applied to non-top-k routing schemes such as soft or learned gating to enforce similar balance guarantees.
  • Theoretical analysis of convergence rates for the mirror-descent update might yield explicit bounds on how quickly the routing distribution approaches uniformity.
  • Joint optimization of the φ-potential together with the task loss could further reduce any residual tension between balance and model accuracy.

Load-bearing premise

That minimizing the chosen strictly convex potential of the expected routing distribution produces the desired population-level balance and that the EMA-based online approximation via mirror descent faithfully tracks the population objective without introducing new bias.

What would settle it

Measuring the gap between the empirical routing distribution obtained under φ-balancing and the exact population optimum in a synthetic router with known true probabilities, or checking whether utilization variance increases when EMA decay is deliberately mismatched to training dynamics.

Figures

Figures reproduced from arXiv: 2605.15403 by Chen Liang, Jonathan Li, Lizhang Chen, Ni Lao, Qiang Liu, Qi Wang, Runlong Liao, Shuozhe Li.

Figure 1
Figure 1. Figure 1: Performance gains on reasoning and code gener￾ation benchmarks. We compare the proposed method (Ours) against the ST-MoE baseline on the Moonlight-16B-A3B-Instruct architecture (Liu et al., 2025). The proposed approach outper￾forms the baseline across all selected tasks, yielding significant gains in mathematical reasoning (Math500), general capability (LiveBench), code synthesis (HumanEval), and logic (BB… view at source ↗
Figure 2
Figure 2. Figure 2: Pretraining scaling studies under controlled per-token compute. We evaluate routing stability and optimization across three orthogonal MoE scaling axes, while keeping the per-token computational cost (FLOPs) approximately constant within each study by adjusting expert size as needed. (Left) Active-parameter scaling: we train models with E = 16 experts and A = 2 active experts per token, varying the number … view at source ↗
Figure 3
Figure 3. Figure 3: Pre-Training dynamics and expert utilization. We compare ϕ-balancing (red, solid) against ST-MoE (blue, dashed) over 10k steps. (Left) Validation Loss and Accuracy show that ϕ-balancing (negative entropy) achieves comparable or superior convergence. (Right) Gini coefficient and Expert Loading Analysis demonstrates significantly lower routing imbalance for ϕ-balancing. ϕ-balancing maintains tighter bounds b… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of the EMA decay parameter η. Validation loss (red; left) and accuracy (blue; right) are shown as η varies over [0, 1]. We see that the best trade-off is achieved for η ∈ [0.6, 0.7]. Performance becomes unstable at high decay, where load estimates revert to single-batch statistics and exhibit high variance. 4.2. Downstream Fine-Tuning We now evaluate ϕ-balancing on three large MoE back… view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of ablation method combinations across three model architectures. The radar charts illustrate the evaluation of DeepSeek-MoE-Chat, DeepSeek-V2-Lite, and Moonlight-16B-A3B-Instruct on seven diverse benchmarks. Radial axes represent the corresponding benchmark scores, identified by the labels and the color-coded outer rim segments. The proposed method (Ours) demonstrates a consistent e… view at source ↗
Figure 6
Figure 6. Figure 6: Domain specialization in routing. Routed-token ratio (fraction of tokens) assigned to each expert (IDs 0–7) for different data domains (Arxiv, Books, C4, Github, Stack, Wiki) at two representative layers (Layer 5 and Layer 11). Compared to ST-MoE, our router exhibits sharper domain-to-expert preferences (stronger specialization), albeit with mildly uneven expert loads. (Zhou et al., 2022), DeepSeekMoE’s fi… view at source ↗
Figure 7
Figure 7. Figure 7: Hyperparameter sensitivity analysis. Heatmaps displaying Validation Accuracy (left) and Gini Coefficient (right) across varying Learning Rates (γ ∈ {1e-3, . . . , 4e-3}) and ϕ-balancing loss coefficient (α ∈ {0.001, . . . , 0.015}). While accuracy remains robust (peak 0.4136), increasing α drastically reduces the Gini coefficient. For the hyperparameters listed in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of entropy order α on validation performance. Validation loss and accuracy are reported after 20.5K training steps for Tsallis and Renyi entropy variants across different values of ´ α. Both objectives exhibit a stable accuracy profile near α ≈ 0.9–1.0. Conversely, extreme values of α degrade validation loss and/or accuracy, indicating that moderate entropy orders yield the most robust performance. … view at source ↗
read the original abstract

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes φ-balancing, a framework for load balancing in Mixture-of-Experts models that minimizes a strictly convex, symmetric, and differentiable potential of the expected routing distribution to target population-level balance. Using convex duality, it derives an equivalent min-max problem and obtains a simple online algorithm via mirror descent that yields an EMA-based routing adjustment with negligible overhead. Empirical evaluations on large-scale pretraining and downstream fine-tuning tasks claim consistent outperformance over Switch-style and loss-free baselines in stability and expert utilization.

Significance. If the derivation and results hold, the work offers a principled, optimization-based alternative to heuristic batch-level balancing methods in MoE training. The negligible overhead and focus on population-level objectives could improve scalability and stability in large models. The clean use of convex duality and mirror descent on a new potential is a methodological strength that may generalize beyond the reported settings.

major comments (2)
  1. [§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.
  2. [§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.
minor comments (2)
  1. [§2] The potential function φ is introduced as strictly convex and symmetric, but its explicit form and differentiability properties should be stated earlier (e.g., in §2) to aid readability before the duality step.
  2. [§3] Notation for the expected routing distribution and the EMA update could be unified across the derivation and algorithm box to avoid minor confusion between population and batch quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.

    Authors: We agree that a formal convergence rate or bias bound under non-stationary routing would provide stronger theoretical support. The derivation in §3.2 shows that the mirror-descent update on the dual problem yields an unbiased direction with respect to the population gradient of the chosen potential; the EMA is introduced solely as a variance-reduction mechanism whose bias vanishes as the decay factor approaches 1. We have added a new paragraph in the revised §3.2 that (i) explicitly states the unbiasedness property, (ii) discusses why the symmetric convex potential limits systematic lag compared with batch-level heuristics, and (iii) reports an empirical tracking-error plot over the full pretraining trajectory. A complete non-asymptotic analysis under arbitrary non-stationarity would require mixing-time assumptions on the router that lie outside the paper’s scope; we therefore treat the combination of the exact dual derivation and the long-run empirical evidence as sufficient for the practical claims made. revision: partial

  2. Referee: [§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.

    Authors: We accept the criticism and have substantially expanded the experimental section. The revised manuscript now includes, in the caption of Table 1 and in §5.1, the exact number of experts (8/16/32), routing temperature (τ=1.0 unless otherwise noted), pretraining dataset scale (approximately 100 B tokens), and standard deviations computed over three independent random seeds. A new supplementary table further reports results across a wider hyper-parameter sweep. These additions confirm that the reported gains remain consistent and are not artifacts of a single favorable configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard convex duality and mirror descent to a new potential

full rationale

The paper introduces φ-balancing by defining a strictly convex symmetric differentiable potential over the expected routing distribution, then applies convex duality to derive an equivalent min-max problem and mirror descent to obtain an EMA-based online update. This chain relies on well-established optimization primitives (convex duality, mirror descent) applied to a freshly proposed potential rather than re-expressing fitted parameters, prior self-citations, or input statistics as outputs. No load-bearing step reduces by construction to its own inputs, and the framework remains self-contained against external benchmarks without invoking author-specific uniqueness theorems or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the existence of a suitable strictly convex symmetric differentiable potential whose population minimum corresponds to balanced routing, plus the validity of the EMA approximation for the online setting.

axioms (2)
  • domain assumption A strictly convex, symmetric, differentiable potential exists whose minimization yields population-level expert balance
    Central to the framework stated in the abstract
  • standard math Convex duality produces an equivalent min-max problem solvable by mirror descent
    Invoked to obtain the online algorithm

pith-pipeline@v0.9.0 · 5669 in / 1238 out tokens · 43411 ms · 2026-05-19T16:11:46.193593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 10 internal anchors

  1. [1]

    Lion secretly solves a constrained optimization: As Lyapunov predicts

    Chen, L., Liu, B., Liang, K., and Liu, Q. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

  2. [2]

    Evaluating Large Language Models Trained on Code

    Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., and Liu, Q. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026a. Chen, L., Li, J., and Liu, Q. Muon optimizes under spectral norm constraints.Trans. Mach. Learn. Res., 2026, 2026b. Chen, M., Tworek, J., Jun, H., Yuan,...

  3. [3]

    Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C., Lu, Y ., and Le, Q. V . Symbolic discovery of optimization algorithms. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,

  4. [4]

    I., Nazarian, S., Thompson, P

    Cheng, A., Duan, S., Li, S., Yin, C., Cheng, M., Ping, H., Chattopadhyay, T., Thomopoulos, S. I., Nazarian, S., Thompson, P. M., and Bogdan, P. ERMoE: Eigen- reparameterized mixture-of-experts for stable routing and interpretable specialization.CoRR, abs/2511.10971,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

  6. [6]

    StableMoE: Stable routing strategy for mixture of experts

    Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pp. 7085–7095,

  7. [7]

    X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y

    Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volum...

  8. [8]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024a. DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024b. Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(120):1–39,

  9. [9]

    Learning mix- tures of experts with EM: A mirror descent perspective

    Fruytier, Q., Mokhtari, A., and Sanghavi, S. Learning mix- tures of experts with EM: A mirror descent perspective. InForty-second International Conference on Machine Learning, ICML 2025,

  10. [10]

    Advancing expert specialization for better MoE

    Guo, H., Lu, H., Nan, G., Chu, B., Zhuang, J., Yang, Y ., Che, W., Leng, S., Cui, Q., and Jiang, X. Advancing expert specialization for better MoE. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025,

  11. [11]

    Harder task needs more experts: Dynamic routing in MoE models

    10 ϕ-Balancing for Mixture-of-Experts Training Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y ., Xu, K., Chen, L., Huang, S., and Feng, Y . Harder task needs more experts: Dynamic routing in MoE models. InProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), ACL 2024, pp. 12883–12895,

  12. [12]

    Gemma 3 Technical Report

    Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman, B., Gao, Y ...

  13. [13]

    Revisit visual prompt tuning: The expressiveness of prompt experts

    Le, M., Nguyen, A., Nguyen, H., Nguyen, C., Tran, A., and Ho, N. Revisit visual prompt tuning: The expressiveness of prompt experts. InThe Fourteenth International Con- ference on Learning Representations, ICLR 2026,

  14. [14]

    GShard: Scaling giant models with conditional computation and automatic sharding

    Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In9th International Conference on Learning Representations, ICLR 2021,

  15. [15]

    co/datasets/AI-MO/NuminaMath-CoT

    URL https://huggingface. co/datasets/AI-MO/NuminaMath-CoT. Li, S., Tadiparthi, V ., Lee, K., Agarwal, N., Mahjoub, H. N., Moradi-Pari, E., Chen, L., Zhang, A., and Leqi, L. Learn- ing robust reasoning through guided adversarial self-play. CoRR, abs/2602.00173,

  16. [16]

    Liang, K., Liu, B., Chen, L., and Liu, Q

    URL https://github.com/ google-deepmind/simply. Liang, K., Liu, B., Chen, L., and Liu, Q. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

  17. [17]

    Cautious opti- mizers: Improving training with one line of code

    Liang, K., Chen, L., Liu, B., and Liu, Q. Cautious opti- mizers: Improving training with one line of code. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

  18. [18]

    Let’s verify step by step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

  19. [19]

    Communication efficient distributed training with distributed Lion

    Liu, B., Wu, L., Chen, L., Liang, K., Zhu, J., Liang, C., Krishnamoorthi, R., and Liu, Q. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

  20. [20]

    Muon is Scalable for LLM Training

    11 ϕ-Balancing for Mixture-of-Experts Training Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for LLM training.CoR...

  21. [21]

    Coupling experts and routers in mixture-of-experts via an auxiliary loss

    Lv, A., Ma, J., Ma, Y ., and Qiao, S. Coupling experts and routers in mixture-of-experts via an auxiliary loss. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

  22. [22]

    A comprehen- sive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

    Mu, S. and Lin, S. A comprehensive survey of mixture- of-experts: Algorithms, theory, and applications.CoRR, abs/2503.07137,

  23. [23]

    P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N

    Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, E. P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Ha- jishirzi, H. OLMoE: Open mixture-of-experts language models. InThe ...

  24. [24]

    Sigmoid gating is more sample efficient than softmax gating in mixture of experts

    Nguyen, H., Ho, N., and Rinaldo, A. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

  25. [25]

    Memory-efficient optimization with factorized Hamiltonian descent

    Nguyen, S., Chen, L., Liu, B., and Liu, Q. Memory-efficient optimization with factorized Hamiltonian descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learn- ing Research, pp. 2863–2871,

  26. [26]

    Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

    Omi, N., Sen, S., and Farhadi, A. Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

  27. [27]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925,

  28. [28]

    P., and Liu, Q

    Peng, B., Chen, L., Su, B., Quesnelle, J., Kingma, D. P., and Liu, Q. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

  29. [29]

    Qwen3 Technical Report

    Qiu, Z., Huang, Z., Cheng, S., Zhou, Y ., Wang, Z., Titov, I., and Fu, J. Layerwise recurrent router for mixture-of- experts. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Qiu, Z., Huang, Z., Zheng, B., Wen, K., Wang, Z., Men, R., Titov, I., Liu, D., Zhou, J., and Lin, J. Demons in the detail: On implementing loa...

  30. [30]

    Y ., Awan, A

    Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y ., Awan, A. A., Rasley, J., and He, Y . DeepSpeed- MoE: Advancing mixture-of-experts inference and train- ing to power next-generation AI scale. InInternational Conference on Machine Learning, ICML 2022, Proceed- ings of Machine Learning Research, pp. 18332–18346,

  31. [31]

    SQuAD: 100,000+ questions for machine comprehension of text

    Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 2383–2392,

  32. [32]

    S., Keysers, D., and Houlsby, N

    Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. InAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 8583–8595,

  33. [33]

    GLU Variants Improve Transformer

    Shazeer, N. GLU variants improve transformer.CoRR, abs/2002.05202,

  34. [34]

    V ., Hinton, G

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Rep- resentations, ICLR 2017,

  35. [35]

    D., Ng, A

    12 ϕ-Balancing for Mixture-of-Experts Training Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pp. 1631–1642,

  36. [36]

    Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H

    URL https://openreview.net/forum? id=kW5hSRG5wq. Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., and Wei, J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, volume ACL 2023 ofFindings...

  37. [37]

    H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N

    Thai, G. H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N. SAGE: Shape-adapting gated experts for adaptive histopathology image segmen- tation.CoRR, abs/2511.18493,

  38. [38]

    Towards greater leverage: Scaling laws for efficient mixture-of-experts language models

    Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

  39. [39]

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and anal- ysis platform for natural language understanding. In7th International Conference on Learning Representations, ICLR 2019,

  40. [40]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Wang, A., Sun, X., Xie, R., Li, S., Zhu, J., Yang, Z., Zhao, P., Han, W., Kang, Z., Wang, D., Okazaki, N., and Xu, C. HMoE: Heterogeneous mixture of experts for lan- guage modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, pp. 21943–21957, 2025a. Wang, L., Gao, H., Zhao, C., Sun, X., and Dai, D....

  41. [41]

    ReMoE: Fully differentiable mixture-of-experts with ReLU routing

    Wang, Z., Zhu, J., and Chen, J. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Warstadt, A., Singh, A., and Bowman, S. R. Neural net- work acceptability judgments.Trans. Assoc. Comput. Linguistics, 7:625–641,

  42. [42]

    Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, pp. 1112–1122,

  43. [43]

    TC- MoE: Augmenting mixture of experts with ternary expert choice

    Yan, S., Bin, X., Zhang, S., Wang, Y ., and Lin, Z. TC- MoE: Augmenting mixture of experts with ternary expert choice. InThe Thirteenth International Conference on Learning Representations, ICLR 2025,

  44. [44]

    Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

    Yang, J. Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

  45. [45]

    AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models

    Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Findings of ACL, pp. 6223–6235,

  46. [46]

    FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a

    Zhang, B., Chen, X., Zhang, S., Zhang, S., Zhou, X., and Sun, M. FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a. Zhang, D., Song, J., Bi, Z., Yuan, Y ., Wang, T., Yeong, J., and Hao, J. Mixture of experts in large language models. CoRR, abs/2507.11181, 2025b. Zhang, K., Li, B., Zhang, P., Pu, F., Ca...

  47. [47]

    Y ., Dai, A

    Zhou, Y ., Lei, T., Liu, H., Du, N., Huang, Y ., Zhao, V . Y ., Dai, A. M., Chen, Z., Le, Q. V ., and Laudon, J. Mixture- of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022,

  48. [48]

    • Load is the average (ideal balanced) load across experts

    is a metric that quantifies load imbalance in MoE models, defined as MaxVioglobal = maxe Loade − Load Load , where • Load e is the number of tokens assigned to experte. • Load is the average (ideal balanced) load across experts. A lower value indicates more balanced expert utilization, while a higher value reflects severe imbalance. It evaluates global lo...

  49. [49]

    requires that the gradient of its objective inqvanish atq t+1, i.e. ∇qF(q t;p t)− 1 η (∇ϕ∗(qt+1)− ∇ϕ ∗(qt)) =0.(14) Substituting (13) and∇ϕ ∗(qt) =m t into (14) and rearranging yields the primal-space update mt+1 =∇ϕ ∗(qt+1)←m t +η(p t −m t) = (1−η)m t +ηp t, which is a convex combination of mt and pt. Since Dϕ is convex, the iterate mt+1 remains in Dϕ. M...