φ-Balancing for Mixture-of-Experts Training
Pith reviewed 2026-05-19 16:11 UTC · model grok-4.3
The pith
Mixture-of-experts models achieve population-level expert balance by minimizing a strictly convex potential of the expected routing distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
φ-balancing directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, this yields an equivalent min-max formulation solved by a simple online algorithm based on mirror descent, resulting in an efficient EMA-based routing adjustment that consistently improves stability and effectiveness of expert utilization across large-scale pretraining and fine-tuning.
What carries the argument
The φ-potential, a strictly convex symmetric differentiable function of the population expected routing distribution, whose dual produces the min-max problem solved by mirror descent on routing logits with EMA tracking.
If this is right
- Expert activation patterns become more stable across training steps because mini-batch noise is replaced by a population-level target.
- Large-scale pretraining and downstream fine-tuning exhibit higher effective parameter usage without extra compute cost.
- The same balancing objective can be applied at every MoE layer independently while preserving differentiability.
- Routing decisions remain compatible with top-k selection and standard loss functions used in existing MoE architectures.
Where Pith is reading between the lines
- The convex-potential approach could be applied to non-top-k routing schemes such as soft or learned gating to enforce similar balance guarantees.
- Theoretical analysis of convergence rates for the mirror-descent update might yield explicit bounds on how quickly the routing distribution approaches uniformity.
- Joint optimization of the φ-potential together with the task loss could further reduce any residual tension between balance and model accuracy.
Load-bearing premise
That minimizing the chosen strictly convex potential of the expected routing distribution produces the desired population-level balance and that the EMA-based online approximation via mirror descent faithfully tracks the population objective without introducing new bias.
What would settle it
Measuring the gap between the empirical routing distribution obtained under φ-balancing and the exact population optimum in a synthetic router with known true probabilities, or checking whether utilization variance increases when EMA decay is deliberately mismatched to training dynamics.
Figures
read the original abstract
Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes φ-balancing, a framework for load balancing in Mixture-of-Experts models that minimizes a strictly convex, symmetric, and differentiable potential of the expected routing distribution to target population-level balance. Using convex duality, it derives an equivalent min-max problem and obtains a simple online algorithm via mirror descent that yields an EMA-based routing adjustment with negligible overhead. Empirical evaluations on large-scale pretraining and downstream fine-tuning tasks claim consistent outperformance over Switch-style and loss-free baselines in stability and expert utilization.
Significance. If the derivation and results hold, the work offers a principled, optimization-based alternative to heuristic batch-level balancing methods in MoE training. The negligible overhead and focus on population-level objectives could improve scalability and stability in large models. The clean use of convex duality and mirror descent on a new potential is a methodological strength that may generalize beyond the reported settings.
major comments (2)
- [§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.
- [§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.
minor comments (2)
- [§2] The potential function φ is introduced as strictly convex and symmetric, but its explicit form and differentiability properties should be stated earlier (e.g., in §2) to aid readability before the duality step.
- [§3] Notation for the expected routing distribution and the EMA update could be unified across the derivation and algorithm box to avoid minor confusion between population and batch quantities.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate clarifications and additional details where appropriate.
read point-by-point responses
-
Referee: [§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.
Authors: We agree that a formal convergence rate or bias bound under non-stationary routing would provide stronger theoretical support. The derivation in §3.2 shows that the mirror-descent update on the dual problem yields an unbiased direction with respect to the population gradient of the chosen potential; the EMA is introduced solely as a variance-reduction mechanism whose bias vanishes as the decay factor approaches 1. We have added a new paragraph in the revised §3.2 that (i) explicitly states the unbiasedness property, (ii) discusses why the symmetric convex potential limits systematic lag compared with batch-level heuristics, and (iii) reports an empirical tracking-error plot over the full pretraining trajectory. A complete non-asymptotic analysis under arbitrary non-stationarity would require mixing-time assumptions on the router that lie outside the paper’s scope; we therefore treat the combination of the exact dual derivation and the long-run empirical evidence as sufficient for the practical claims made. revision: partial
-
Referee: [§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.
Authors: We accept the criticism and have substantially expanded the experimental section. The revised manuscript now includes, in the caption of Table 1 and in §5.1, the exact number of experts (8/16/32), routing temperature (τ=1.0 unless otherwise noted), pretraining dataset scale (approximately 100 B tokens), and standard deviations computed over three independent random seeds. A new supplementary table further reports results across a wider hyper-parameter sweep. These additions confirm that the reported gains remain consistent and are not artifacts of a single favorable configuration. revision: yes
Circularity Check
No significant circularity; derivation applies standard convex duality and mirror descent to a new potential
full rationale
The paper introduces φ-balancing by defining a strictly convex symmetric differentiable potential over the expected routing distribution, then applies convex duality to derive an equivalent min-max problem and mirror descent to obtain an EMA-based online update. This chain relies on well-established optimization primitives (convex duality, mirror descent) applied to a freshly proposed potential rather than re-expressing fitted parameters, prior self-citations, or input statistics as outputs. No load-bearing step reduces by construction to its own inputs, and the framework remains self-contained against external benchmarks without invoking author-specific uniqueness theorems or ansatzes smuggled via citation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A strictly convex, symmetric, differentiable potential exists whose minimization yields population-level expert balance
- standard math Convex duality produces an equivalent min-max problem solvable by mirror descent
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel / dAlembert_to_ODE_general echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
minimizing a strictly convex, symmetric, and differentiable potential ϕ applied to the population mean routing distribution... Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent... mt+1 ← (1−η)mt + ηpt , qt+1 ← ∇ϕ(mt+1)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero / alpha_pin_under_high_calibration refines?
refinesRelation between the paper passage and the cited Recognition theorem.
LOG-COSH(β>0) ... Laux = Σ pt,e · tanh(β mt+1,e)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection / RCLCombiner_isCoupling_iff echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
RCLCombiner_isCoupling_iff ... branch_selection (c ≠ 0 forces bilinear branch)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Lion secretly solves a constrained optimization: As Lyapunov predicts
Chen, L., Liu, B., Liang, K., and Liu, Q. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024,
work page 2024
-
[2]
Evaluating Large Language Models Trained on Code
Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., and Liu, Q. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026a. Chen, L., Li, J., and Liu, Q. Muon optimizes under spectral norm constraints.Trans. Mach. Learn. Res., 2026, 2026b. Chen, M., Tworek, J., Jun, H., Yuan,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C., Lu, Y ., and Le, Q. V . Symbolic discovery of optimization algorithms. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,
work page 2023
-
[4]
Cheng, A., Duan, S., Li, S., Yin, C., Cheng, M., Ping, H., Chattopadhyay, T., Thomopoulos, S. I., Nazarian, S., Thompson, P. M., and Bogdan, P. ERMoE: Eigen- reparameterized mixture-of-experts for stable routing and interpretable specialization.CoRR, abs/2511.10971,
-
[5]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
StableMoE: Stable routing strategy for mixture of experts
Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pp. 7085–7095,
work page 2022
-
[7]
X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y
Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volum...
work page 2024
-
[8]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024a. DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024b. Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(120):1–39,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Learning mix- tures of experts with EM: A mirror descent perspective
Fruytier, Q., Mokhtari, A., and Sanghavi, S. Learning mix- tures of experts with EM: A mirror descent perspective. InForty-second International Conference on Machine Learning, ICML 2025,
work page 2025
-
[10]
Advancing expert specialization for better MoE
Guo, H., Lu, H., Nan, G., Chu, B., Zhuang, J., Yang, Y ., Che, W., Leng, S., Cui, Q., and Jiang, X. Advancing expert specialization for better MoE. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025,
work page 2025
-
[11]
Harder task needs more experts: Dynamic routing in MoE models
10 ϕ-Balancing for Mixture-of-Experts Training Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y ., Xu, K., Chen, L., Huang, S., and Feng, Y . Harder task needs more experts: Dynamic routing in MoE models. InProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), ACL 2024, pp. 12883–12895,
work page 2024
-
[12]
Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman, B., Gao, Y ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Revisit visual prompt tuning: The expressiveness of prompt experts
Le, M., Nguyen, A., Nguyen, H., Nguyen, C., Tran, A., and Ho, N. Revisit visual prompt tuning: The expressiveness of prompt experts. InThe Fourteenth International Con- ference on Learning Representations, ICLR 2026,
work page 2026
-
[14]
GShard: Scaling giant models with conditional computation and automatic sharding
Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In9th International Conference on Learning Representations, ICLR 2021,
work page 2021
-
[15]
co/datasets/AI-MO/NuminaMath-CoT
URL https://huggingface. co/datasets/AI-MO/NuminaMath-CoT. Li, S., Tadiparthi, V ., Lee, K., Agarwal, N., Mahjoub, H. N., Moradi-Pari, E., Chen, L., Zhang, A., and Leqi, L. Learn- ing robust reasoning through guided adversarial self-play. CoRR, abs/2602.00173,
-
[16]
Liang, K., Liu, B., Chen, L., and Liu, Q
URL https://github.com/ google-deepmind/simply. Liang, K., Liu, B., Chen, L., and Liu, Q. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,
work page 2024
-
[17]
Cautious opti- mizers: Improving training with one line of code
Liang, K., Chen, L., Liu, B., and Liu, Q. Cautious opti- mizers: Improving training with one line of code. In The Fourteenth International Conference on Learning Representations, ICLR 2026,
work page 2026
-
[18]
Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024,
work page 2024
-
[19]
Communication efficient distributed training with distributed Lion
Liu, B., Wu, L., Chen, L., Liang, K., Zhu, J., Liang, C., Krishnamoorthi, R., and Liu, Q. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,
work page 2024
-
[20]
Muon is Scalable for LLM Training
11 ϕ-Balancing for Mixture-of-Experts Training Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for LLM training.CoR...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Coupling experts and routers in mixture-of-experts via an auxiliary loss
Lv, A., Ma, J., Ma, Y ., and Qiao, S. Coupling experts and routers in mixture-of-experts via an auxiliary loss. In The Fourteenth International Conference on Learning Representations, ICLR 2026,
work page 2026
-
[22]
Mu, S. and Lin, S. A comprehensive survey of mixture- of-experts: Algorithms, theory, and applications.CoRR, abs/2503.07137,
-
[23]
Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, E. P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Ha- jishirzi, H. OLMoE: Open mixture-of-experts language models. InThe ...
work page 2025
-
[24]
Sigmoid gating is more sample efficient than softmax gating in mixture of experts
Nguyen, H., Ho, N., and Rinaldo, A. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,
work page 2024
-
[25]
Memory-efficient optimization with factorized Hamiltonian descent
Nguyen, S., Chen, L., Liu, B., and Liu, Q. Memory-efficient optimization with factorized Hamiltonian descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learn- ing Research, pp. 2863–2871,
work page 2025
-
[26]
Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,
Omi, N., Sen, S., and Farhadi, A. Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,
-
[27]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Peng, B., Chen, L., Su, B., Quesnelle, J., Kingma, D. P., and Liu, Q. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,
work page 2026
-
[29]
Qiu, Z., Huang, Z., Cheng, S., Zhou, Y ., Wang, Z., Titov, I., and Fu, J. Layerwise recurrent router for mixture-of- experts. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Qiu, Z., Huang, Z., Zheng, B., Wen, K., Wang, Z., Men, R., Titov, I., Liu, D., Zhou, J., and Lin, J. Demons in the detail: On implementing loa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y ., Awan, A. A., Rasley, J., and He, Y . DeepSpeed- MoE: Advancing mixture-of-experts inference and train- ing to power next-generation AI scale. InInternational Conference on Machine Learning, ICML 2022, Proceed- ings of Machine Learning Research, pp. 18332–18346,
work page 2022
-
[31]
SQuAD: 100,000+ questions for machine comprehension of text
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 2383–2392,
work page 2016
-
[32]
S., Keysers, D., and Houlsby, N
Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. InAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 8583–8595,
work page 2021
-
[33]
GLU Variants Improve Transformer
Shazeer, N. GLU variants improve transformer.CoRR, abs/2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[34]
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Rep- resentations, ICLR 2017,
work page 2017
-
[35]
12 ϕ-Balancing for Mixture-of-Experts Training Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pp. 1631–1642,
work page 2013
-
[36]
Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H
URL https://openreview.net/forum? id=kW5hSRG5wq. Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., and Wei, J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, volume ACL 2023 ofFindings...
work page 2023
-
[37]
H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N
Thai, G. H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N. SAGE: Shape-adapting gated experts for adaptive histopathology image segmen- tation.CoRR, abs/2511.18493,
-
[38]
Towards greater leverage: Scaling laws for efficient mixture-of-experts language models
Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,
work page 2026
-
[39]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and anal- ysis platform for natural language understanding. In7th International Conference on Learning Representations, ICLR 2019,
work page 2019
-
[40]
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Wang, A., Sun, X., Xie, R., Li, S., Zhu, J., Yang, Z., Zhao, P., Han, W., Kang, Z., Wang, D., Okazaki, N., and Xu, C. HMoE: Heterogeneous mixture of experts for lan- guage modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, pp. 21943–21957, 2025a. Wang, L., Gao, H., Zhao, C., Sun, X., and Dai, D....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
ReMoE: Fully differentiable mixture-of-experts with ReLU routing
Wang, Z., Zhu, J., and Chen, J. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Warstadt, A., Singh, A., and Bowman, S. R. Neural net- work acceptability judgments.Trans. Assoc. Comput. Linguistics, 7:625–641,
work page 2025
-
[42]
Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, pp. 1112–1122,
work page 2018
-
[43]
TC- MoE: Augmenting mixture of experts with ternary expert choice
Yan, S., Bin, X., Zhang, S., Wang, Y ., and Lin, Z. TC- MoE: Augmenting mixture of experts with ternary expert choice. InThe Thirteenth International Conference on Learning Representations, ICLR 2025,
work page 2025
-
[44]
Yang, J. Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,
-
[45]
AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models
Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Findings of ACL, pp. 6223–6235,
work page 2024
-
[46]
Zhang, B., Chen, X., Zhang, S., Zhang, S., Zhou, X., and Sun, M. FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a. Zhang, D., Song, J., Bi, Z., Yuan, Y ., Wang, T., Yeong, J., and Hao, J. Mixture of experts in large language models. CoRR, abs/2507.11181, 2025b. Zhang, K., Li, B., Zhang, P., Pu, F., Ca...
work page internal anchor Pith review arXiv 2025
-
[47]
Zhou, Y ., Lei, T., Liu, H., Du, N., Huang, Y ., Zhao, V . Y ., Dai, A. M., Chen, Z., Le, Q. V ., and Laudon, J. Mixture- of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022,
work page 2022
-
[48]
• Load is the average (ideal balanced) load across experts
is a metric that quantifies load imbalance in MoE models, defined as MaxVioglobal = maxe Loade − Load Load , where • Load e is the number of tokens assigned to experte. • Load is the average (ideal balanced) load across experts. A lower value indicates more balanced expert utilization, while a higher value reflects severe imbalance. It evaluates global lo...
work page 1970
-
[49]
requires that the gradient of its objective inqvanish atq t+1, i.e. ∇qF(q t;p t)− 1 η (∇ϕ∗(qt+1)− ∇ϕ ∗(qt)) =0.(14) Substituting (13) and∇ϕ ∗(qt) =m t into (14) and rearranging yields the primal-space update mt+1 =∇ϕ ∗(qt+1)←m t +η(p t −m t) = (1−η)m t +ηp t, which is a convex combination of mt and pt. Since Dϕ is convex, the iterate mt+1 remains in Dϕ. M...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.