$\phi$-Balancing for Mixture-of-Experts Training

Chen Liang; Jonathan Li; Lizhang Chen; Ni Lao; Qiang Liu; Qi Wang; Runlong Liao; Shuozhe Li

arxiv: 2605.15403 · v1 · pith:OZUPR4BVnew · submitted 2026-05-14 · 💻 cs.LG · math.OC· stat.ML

φ-Balancing for Mixture-of-Experts Training

Lizhang Chen , Jonathan Li , Qi Wang , Runlong Liao , Shuozhe Li , Chen Liang , Ni Lao , Qiang Liu This is my paper

Pith reviewed 2026-05-19 16:11 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords mixture of expertsload balancingexpert utilizationconvex optimizationmirror descentdeep learningmodel scaling

0 comments

The pith

Mixture-of-experts models achieve population-level expert balance by minimizing a strictly convex potential of the expected routing distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing load-balancing methods in Mixture-of-Experts models rely on noisy mini-batch assignment counts that bias the system away from true population-level objectives. The paper proposes φ-balancing to minimize a strictly convex, symmetric, and differentiable potential directly on the expected routing distribution instead. Convex duality converts the problem into a min-max saddle-point task, which mirror descent solves through a simple exponential moving average update on the routing logits. This adjustment carries negligible overhead yet produces more stable expert activation than Switch-style or loss-free baselines during both pretraining and downstream fine-tuning. A reader would care because balanced utilization lets the model deploy its full parameter count without leaving experts idle.

Core claim

φ-balancing directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, this yields an equivalent min-max formulation solved by a simple online algorithm based on mirror descent, resulting in an efficient EMA-based routing adjustment that consistently improves stability and effectiveness of expert utilization across large-scale pretraining and fine-tuning.

What carries the argument

The φ-potential, a strictly convex symmetric differentiable function of the population expected routing distribution, whose dual produces the min-max problem solved by mirror descent on routing logits with EMA tracking.

If this is right

Expert activation patterns become more stable across training steps because mini-batch noise is replaced by a population-level target.
Large-scale pretraining and downstream fine-tuning exhibit higher effective parameter usage without extra compute cost.
The same balancing objective can be applied at every MoE layer independently while preserving differentiability.
Routing decisions remain compatible with top-k selection and standard loss functions used in existing MoE architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The convex-potential approach could be applied to non-top-k routing schemes such as soft or learned gating to enforce similar balance guarantees.
Theoretical analysis of convergence rates for the mirror-descent update might yield explicit bounds on how quickly the routing distribution approaches uniformity.
Joint optimization of the φ-potential together with the task loss could further reduce any residual tension between balance and model accuracy.

Load-bearing premise

That minimizing the chosen strictly convex potential of the expected routing distribution produces the desired population-level balance and that the EMA-based online approximation via mirror descent faithfully tracks the population objective without introducing new bias.

What would settle it

Measuring the gap between the empirical routing distribution obtained under φ-balancing and the exact population optimum in a synthetic router with known true probabilities, or checking whether utilization variance increases when EMA decay is deliberately mismatched to training dynamics.

Figures

Figures reproduced from arXiv: 2605.15403 by Chen Liang, Jonathan Li, Lizhang Chen, Ni Lao, Qiang Liu, Qi Wang, Runlong Liao, Shuozhe Li.

**Figure 1.** Figure 1: Performance gains on reasoning and code generation benchmarks. We compare the proposed method (Ours) against the ST-MoE baseline on the Moonlight-16B-A3B-Instruct architecture (Liu et al., 2025). The proposed approach outperforms the baseline across all selected tasks, yielding significant gains in mathematical reasoning (Math500), general capability (LiveBench), code synthesis (HumanEval), and logic (BB… view at source ↗

**Figure 2.** Figure 2: Pretraining scaling studies under controlled per-token compute. We evaluate routing stability and optimization across three orthogonal MoE scaling axes, while keeping the per-token computational cost (FLOPs) approximately constant within each study by adjusting expert size as needed. (Left) Active-parameter scaling: we train models with E = 16 experts and A = 2 active experts per token, varying the number … view at source ↗

**Figure 3.** Figure 3: Pre-Training dynamics and expert utilization. We compare ϕ-balancing (red, solid) against ST-MoE (blue, dashed) over 10k steps. (Left) Validation Loss and Accuracy show that ϕ-balancing (negative entropy) achieves comparable or superior convergence. (Right) Gini coefficient and Expert Loading Analysis demonstrates significantly lower routing imbalance for ϕ-balancing. ϕ-balancing maintains tighter bounds b… view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of the EMA decay parameter η. Validation loss (red; left) and accuracy (blue; right) are shown as η varies over [0, 1]. We see that the best trade-off is achieved for η ∈ [0.6, 0.7]. Performance becomes unstable at high decay, where load estimates revert to single-batch statistics and exhibit high variance. 4.2. Downstream Fine-Tuning We now evaluate ϕ-balancing on three large MoE back… view at source ↗

**Figure 5.** Figure 5: Performance comparison of ablation method combinations across three model architectures. The radar charts illustrate the evaluation of DeepSeek-MoE-Chat, DeepSeek-V2-Lite, and Moonlight-16B-A3B-Instruct on seven diverse benchmarks. Radial axes represent the corresponding benchmark scores, identified by the labels and the color-coded outer rim segments. The proposed method (Ours) demonstrates a consistent e… view at source ↗

**Figure 6.** Figure 6: Domain specialization in routing. Routed-token ratio (fraction of tokens) assigned to each expert (IDs 0–7) for different data domains (Arxiv, Books, C4, Github, Stack, Wiki) at two representative layers (Layer 5 and Layer 11). Compared to ST-MoE, our router exhibits sharper domain-to-expert preferences (stronger specialization), albeit with mildly uneven expert loads. (Zhou et al., 2022), DeepSeekMoE’s fi… view at source ↗

**Figure 7.** Figure 7: Hyperparameter sensitivity analysis. Heatmaps displaying Validation Accuracy (left) and Gini Coefficient (right) across varying Learning Rates (γ ∈ {1e-3, . . . , 4e-3}) and ϕ-balancing loss coefficient (α ∈ {0.001, . . . , 0.015}). While accuracy remains robust (peak 0.4136), increasing α drastically reduces the Gini coefficient. For the hyperparameters listed in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of entropy order α on validation performance. Validation loss and accuracy are reported after 20.5K training steps for Tsallis and Renyi entropy variants across different values of ´ α. Both objectives exhibit a stable accuracy profile near α ≈ 0.9–1.0. Conversely, extreme values of α degrade validation loss and/or accuracy, indicating that moderate entropy orders yield the most robust performance. … view at source ↗

read the original abstract

Mixture-of-Experts (MoE) models rely on balanced expert utilization to fully realize their scalability. However, existing load-balancing methods are largely heuristic and operate on noisy mini-batch assignment statistics, introducing bias relative to population-level objectives. We propose $\phi$-balancing, a principled framework that directly targets population-level expert balance by minimizing a strictly convex, symmetric, and differentiable potential of the expected routing distribution. Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent, yielding an efficient EMA-based routing adjustment with negligible overhead. Across large-scale pretraining and downstream fine-tuning, $\phi$-balancing consistently outperforms prior Switch-style and loss-free baselines, demonstrating more stable and effective expert utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

φ-balancing gives a convex potential plus duality route to population-level MoE balance with an EMA update, but the non-stationary approximation lacks visible bounds.

read the letter

Hi, the main point on this paper is that it replaces batch-heuristic balancing with a strictly convex symmetric potential on the expected routing distribution, then uses duality to get a min-max problem solved by mirror descent into a cheap EMA adjustment. That setup is distinct from the Switch and loss-free baselines they cite and targets the population objective directly rather than noisy mini-batch stats. They report better stability and utilization on large pretraining and downstream fine-tuning runs, which is the practical test that matters here. The derivation itself uses standard convex tools, so the math looks reproducible if the potential is written out clearly. The experiments apparently run at scale with negligible overhead, which is a plus if the numbers hold. The soft spot is the online approximation. The EMA is supposed to track the population expectation without new bias, but pretraining routing is non-stationary, and there is no convergence rate or bias bound shown in the abstract. If the distribution shifts fast, the lag could weaken the claimed advantage over simpler heuristics; that needs checking in the full derivation and any supporting analysis. The abstract also omits the actual equations and dataset details, so the outperformance numbers are hard to evaluate without the full results section. This is for people training large MoE models who want a more principled balancing objective. A reader working on scalability or routing would get something concrete to test or extend. The core idea is clear and separate from prior work, so it deserves a serious referee rather than a desk reject. I would send it for review but ask specifically for the approximation analysis and the full experimental protocol.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes φ-balancing, a framework for load balancing in Mixture-of-Experts models that minimizes a strictly convex, symmetric, and differentiable potential of the expected routing distribution to target population-level balance. Using convex duality, it derives an equivalent min-max problem and obtains a simple online algorithm via mirror descent that yields an EMA-based routing adjustment with negligible overhead. Empirical evaluations on large-scale pretraining and downstream fine-tuning tasks claim consistent outperformance over Switch-style and loss-free baselines in stability and expert utilization.

Significance. If the derivation and results hold, the work offers a principled, optimization-based alternative to heuristic batch-level balancing methods in MoE training. The negligible overhead and focus on population-level objectives could improve scalability and stability in large models. The clean use of convex duality and mirror descent on a new potential is a methodological strength that may generalize beyond the reported settings.

major comments (2)

[§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.
[§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.

minor comments (2)

[§2] The potential function φ is introduced as strictly convex and symmetric, but its explicit form and differentiability properties should be stated earlier (e.g., in §2) to aid readability before the duality step.
[§3] Notation for the expected routing distribution and the EMA update could be unified across the derivation and algorithm box to avoid minor confusion between population and batch quantities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and have revised the manuscript to incorporate clarifications and additional details where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (derivation of the online algorithm): the claim that the EMA-based mirror descent update faithfully tracks the population objective without new bias lacks a convergence rate, bias bound, or analysis under non-stationary routing distributions. This is load-bearing for the central claim, as pretraining routing statistics change over time and the skeptic concern about systematic lag is not addressed.

Authors: We agree that a formal convergence rate or bias bound under non-stationary routing would provide stronger theoretical support. The derivation in §3.2 shows that the mirror-descent update on the dual problem yields an unbiased direction with respect to the population gradient of the chosen potential; the EMA is introduced solely as a variance-reduction mechanism whose bias vanishes as the decay factor approaches 1. We have added a new paragraph in the revised §3.2 that (i) explicitly states the unbiasedness property, (ii) discusses why the symmetric convex potential limits systematic lag compared with batch-level heuristics, and (iii) reports an empirical tracking-error plot over the full pretraining trajectory. A complete non-asymptotic analysis under arbitrary non-stationarity would require mixing-time assumptions on the router that lie outside the paper’s scope; we therefore treat the combination of the exact dual derivation and the long-run empirical evidence as sufficient for the practical claims made. revision: partial
Referee: [§5] §5 (experimental results): the reported outperformance is asserted without accompanying details on the number of experts, routing temperature, dataset scale, or variance across runs in the main tables; this makes it difficult to evaluate whether the gains are robust or merely reflect favorable hyperparameter choices relative to the Switch and loss-free baselines.

Authors: We accept the criticism and have substantially expanded the experimental section. The revised manuscript now includes, in the caption of Table 1 and in §5.1, the exact number of experts (8/16/32), routing temperature (τ=1.0 unless otherwise noted), pretraining dataset scale (approximately 100 B tokens), and standard deviations computed over three independent random seeds. A new supplementary table further reports results across a wider hyper-parameter sweep. These additions confirm that the reported gains remain consistent and are not artifacts of a single favorable configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard convex duality and mirror descent to a new potential

full rationale

The paper introduces φ-balancing by defining a strictly convex symmetric differentiable potential over the expected routing distribution, then applies convex duality to derive an equivalent min-max problem and mirror descent to obtain an EMA-based online update. This chain relies on well-established optimization primitives (convex duality, mirror descent) applied to a freshly proposed potential rather than re-expressing fitted parameters, prior self-citations, or input statistics as outputs. No load-bearing step reduces by construction to its own inputs, and the framework remains self-contained against external benchmarks without invoking author-specific uniqueness theorems or ansatzes smuggled via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the existence of a suitable strictly convex symmetric differentiable potential whose population minimum corresponds to balanced routing, plus the validity of the EMA approximation for the online setting.

axioms (2)

domain assumption A strictly convex, symmetric, differentiable potential exists whose minimization yields population-level expert balance
Central to the framework stated in the abstract
standard math Convex duality produces an equivalent min-max problem solvable by mirror descent
Invoked to obtain the online algorithm

pith-pipeline@v0.9.0 · 5669 in / 1238 out tokens · 43411 ms · 2026-05-19T16:11:46.193593+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel / dAlembert_to_ODE_general echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

minimizing a strictly convex, symmetric, and differentiable potential ϕ applied to the population mean routing distribution... Using convex duality, we derive an equivalent min-max formulation and obtain a simple online algorithm via mirror descent... mt+1 ← (1−η)mt + ηpt , qt+1 ← ∇ϕ(mt+1)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero / alpha_pin_under_high_calibration refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

LOG-COSH(β>0) ... Laux = Σ pt,e · tanh(β mt+1,e)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection / RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

RCLCombiner_isCoupling_iff ... branch_selection (c ≠ 0 forces bilinear branch)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 10 internal anchors

[1]

Lion secretly solves a constrained optimization: As Lyapunov predicts

Chen, L., Liu, B., Liang, K., and Liu, Q. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

work page 2024
[2]

Evaluating Large Language Models Trained on Code

Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., and Liu, Q. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026a. Chen, L., Li, J., and Liu, Q. Muon optimizes under spectral norm constraints.Trans. Mach. Learn. Res., 2026, 2026b. Chen, M., Tworek, J., Jun, H., Yuan,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C., Lu, Y ., and Le, Q. V . Symbolic discovery of optimization algorithms. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,

work page 2023
[4]

I., Nazarian, S., Thompson, P

Cheng, A., Duan, S., Li, S., Yin, C., Cheng, M., Ping, H., Chattopadhyay, T., Thomopoulos, S. I., Nazarian, S., Thompson, P. M., and Bogdan, P. ERMoE: Eigen- reparameterized mixture-of-experts for stable routing and interpretable specialization.CoRR, abs/2511.10971,

work page arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

StableMoE: Stable routing strategy for mixture of experts

Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pp. 7085–7095,

work page 2022
[7]

X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y

Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volum...

work page 2024
[8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024a. DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024b. Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(120):1–39,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Learning mix- tures of experts with EM: A mirror descent perspective

Fruytier, Q., Mokhtari, A., and Sanghavi, S. Learning mix- tures of experts with EM: A mirror descent perspective. InForty-second International Conference on Machine Learning, ICML 2025,

work page 2025
[10]

Advancing expert specialization for better MoE

Guo, H., Lu, H., Nan, G., Chu, B., Zhuang, J., Yang, Y ., Che, W., Leng, S., Cui, Q., and Jiang, X. Advancing expert specialization for better MoE. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025,

work page 2025
[11]

Harder task needs more experts: Dynamic routing in MoE models

10 ϕ-Balancing for Mixture-of-Experts Training Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y ., Xu, K., Chen, L., Huang, S., and Feng, Y . Harder task needs more experts: Dynamic routing in MoE models. InProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), ACL 2024, pp. 12883–12895,

work page 2024
[12]

Gemma 3 Technical Report

Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman, B., Gao, Y ...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Revisit visual prompt tuning: The expressiveness of prompt experts

Le, M., Nguyen, A., Nguyen, H., Nguyen, C., Tran, A., and Ho, N. Revisit visual prompt tuning: The expressiveness of prompt experts. InThe Fourteenth International Con- ference on Learning Representations, ICLR 2026,

work page 2026
[14]

GShard: Scaling giant models with conditional computation and automatic sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In9th International Conference on Learning Representations, ICLR 2021,

work page 2021
[15]

co/datasets/AI-MO/NuminaMath-CoT

URL https://huggingface. co/datasets/AI-MO/NuminaMath-CoT. Li, S., Tadiparthi, V ., Lee, K., Agarwal, N., Mahjoub, H. N., Moradi-Pari, E., Chen, L., Zhang, A., and Leqi, L. Learn- ing robust reasoning through guided adversarial self-play. CoRR, abs/2602.00173,

work page arXiv
[16]

Liang, K., Liu, B., Chen, L., and Liu, Q

URL https://github.com/ google-deepmind/simply. Liang, K., Liu, B., Chen, L., and Liu, Q. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024
[17]

Cautious opti- mizers: Improving training with one line of code

Liang, K., Chen, L., Liu, B., and Liu, Q. Cautious opti- mizers: Improving training with one line of code. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026
[18]

Let’s verify step by step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

work page 2024
[19]

Communication efficient distributed training with distributed Lion

Liu, B., Wu, L., Chen, L., Liang, K., Zhu, J., Liang, C., Krishnamoorthi, R., and Liu, Q. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024
[20]

Muon is Scalable for LLM Training

11 ϕ-Balancing for Mixture-of-Experts Training Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for LLM training.CoR...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Coupling experts and routers in mixture-of-experts via an auxiliary loss

Lv, A., Ma, J., Ma, Y ., and Qiao, S. Coupling experts and routers in mixture-of-experts via an auxiliary loss. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026
[22]

A comprehen- sive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

Mu, S. and Lin, S. A comprehensive survey of mixture- of-experts: Algorithms, theory, and applications.CoRR, abs/2503.07137,

work page arXiv
[23]

P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N

Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, E. P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Ha- jishirzi, H. OLMoE: Open mixture-of-experts language models. InThe ...

work page 2025
[24]

Sigmoid gating is more sample efficient than softmax gating in mixture of experts

Nguyen, H., Ho, N., and Rinaldo, A. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024
[25]

Memory-efficient optimization with factorized Hamiltonian descent

Nguyen, S., Chen, L., Liu, B., and Liu, Q. Memory-efficient optimization with factorized Hamiltonian descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learn- ing Research, pp. 2863–2871,

work page 2025
[26]

Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

Omi, N., Sen, S., and Farhadi, A. Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

work page arXiv
[27]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

P., and Liu, Q

Peng, B., Chen, L., Su, B., Quesnelle, J., Kingma, D. P., and Liu, Q. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026
[29]

Qwen3 Technical Report

Qiu, Z., Huang, Z., Cheng, S., Zhou, Y ., Wang, Z., Titov, I., and Fu, J. Layerwise recurrent router for mixture-of- experts. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Qiu, Z., Huang, Z., Zheng, B., Wen, K., Wang, Z., Men, R., Titov, I., Liu, D., Zhou, J., and Lin, J. Demons in the detail: On implementing loa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Y ., Awan, A

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y ., Awan, A. A., Rasley, J., and He, Y . DeepSpeed- MoE: Advancing mixture-of-experts inference and train- ing to power next-generation AI scale. InInternational Conference on Machine Learning, ICML 2022, Proceed- ings of Machine Learning Research, pp. 18332–18346,

work page 2022
[31]

SQuAD: 100,000+ questions for machine comprehension of text

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 2383–2392,

work page 2016
[32]

S., Keysers, D., and Houlsby, N

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. InAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 8583–8595,

work page 2021
[33]

GLU Variants Improve Transformer

Shazeer, N. GLU variants improve transformer.CoRR, abs/2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[34]

V ., Hinton, G

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Rep- resentations, ICLR 2017,

work page 2017
[35]

D., Ng, A

12 ϕ-Balancing for Mixture-of-Experts Training Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pp. 1631–1642,

work page 2013
[36]

Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H

URL https://openreview.net/forum? id=kW5hSRG5wq. Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., and Wei, J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, volume ACL 2023 ofFindings...

work page 2023
[37]

H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N

Thai, G. H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N. SAGE: Shape-adapting gated experts for adaptive histopathology image segmen- tation.CoRR, abs/2511.18493,

work page arXiv
[38]

Towards greater leverage: Scaling laws for efficient mixture-of-experts language models

Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026
[39]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and anal- ysis platform for natural language understanding. In7th International Conference on Learning Representations, ICLR 2019,

work page 2019
[40]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Wang, A., Sun, X., Xie, R., Li, S., Zhu, J., Yang, Z., Zhao, P., Han, W., Kang, Z., Wang, D., Okazaki, N., and Xu, C. HMoE: Heterogeneous mixture of experts for lan- guage modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, pp. 21943–21957, 2025a. Wang, L., Gao, H., Zhao, C., Sun, X., and Dai, D....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

ReMoE: Fully differentiable mixture-of-experts with ReLU routing

Wang, Z., Zhu, J., and Chen, J. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Warstadt, A., Singh, A., and Bowman, S. R. Neural net- work acceptability judgments.Trans. Assoc. Comput. Linguistics, 7:625–641,

work page 2025
[42]

Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, pp. 1112–1122,

work page 2018
[43]

TC- MoE: Augmenting mixture of experts with ternary expert choice

Yan, S., Bin, X., Zhang, S., Wang, Y ., and Lin, Z. TC- MoE: Augmenting mixture of experts with ternary expert choice. InThe Thirteenth International Conference on Learning Representations, ICLR 2025,

work page 2025
[44]

Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

Yang, J. Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

work page arXiv
[45]

AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models

Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Findings of ACL, pp. 6223–6235,

work page 2024
[46]

FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a

Zhang, B., Chen, X., Zhang, S., Zhang, S., Zhou, X., and Sun, M. FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a. Zhang, D., Song, J., Bi, Z., Yuan, Y ., Wang, T., Yeong, J., and Hao, J. Mixture of experts in large language models. CoRR, abs/2507.11181, 2025b. Zhang, K., Li, B., Zhang, P., Pu, F., Ca...

work page internal anchor Pith review arXiv 2025
[47]

Y ., Dai, A

Zhou, Y ., Lei, T., Liu, H., Du, N., Huang, Y ., Zhao, V . Y ., Dai, A. M., Chen, Z., Le, Q. V ., and Laudon, J. Mixture- of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022,

work page 2022
[48]

• Load is the average (ideal balanced) load across experts

is a metric that quantifies load imbalance in MoE models, defined as MaxVioglobal = maxe Loade − Load Load , where • Load e is the number of tokens assigned to experte. • Load is the average (ideal balanced) load across experts. A lower value indicates more balanced expert utilization, while a higher value reflects severe imbalance. It evaluates global lo...

work page 1970
[49]

requires that the gradient of its objective inqvanish atq t+1, i.e. ∇qF(q t;p t)− 1 η (∇ϕ∗(qt+1)− ∇ϕ ∗(qt)) =0.(14) Substituting (13) and∇ϕ ∗(qt) =m t into (14) and rearranging yields the primal-space update mt+1 =∇ϕ ∗(qt+1)←m t +η(p t −m t) = (1−η)m t +ηp t, which is a convex combination of mt and pt. Since Dϕ is convex, the iterate mt+1 remains in Dϕ. M...

work page 2026

[1] [1]

Lion secretly solves a constrained optimization: As Lyapunov predicts

Chen, L., Liu, B., Liang, K., and Liu, Q. Lion secretly solves a constrained optimization: As Lyapunov predicts. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

work page 2024

[2] [2]

Evaluating Large Language Models Trained on Code

Chen, L., Li, J., Liang, K., Su, B., Xie, C., Pierse, N. W., Liang, C., Lao, N., and Liu, Q. Cautious weight decay. InThe Fourteenth International Conference on Learning Representations, ICLR 2026, 2026a. Chen, L., Li, J., and Liu, Q. Muon optimizes under spectral norm constraints.Trans. Mach. Learn. Res., 2026, 2026b. Chen, M., Tworek, J., Jun, H., Yuan,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Pham, H., Dong, X., Luong, T., Hsieh, C., Lu, Y ., and Le, Q. V . Symbolic discovery of optimization algorithms. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023,

work page 2023

[4] [4]

I., Nazarian, S., Thompson, P

Cheng, A., Duan, S., Li, S., Yin, C., Cheng, M., Ping, H., Chattopadhyay, T., Thomopoulos, S. I., Nazarian, S., Thompson, P. M., and Bogdan, P. ERMoE: Eigen- reparameterized mixture-of-experts for stable routing and interpretable specialization.CoRR, abs/2511.10971,

work page arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

StableMoE: Stable routing strategy for mixture of experts

Dai, D., Dong, L., Ma, S., Zheng, B., Sui, Z., Chang, B., and Wei, F. StableMoE: Stable routing strategy for mixture of experts. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, pp. 7085–7095,

work page 2022

[7] [7]

X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y

Dai, D., Deng, C., Zhao, C., Xu, R. X., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., Xie, Z., Li, Y . K., Huang, P., Luo, F., Ruan, C., Sui, Z., and Liang, W. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volum...

work page 2024

[8] [8]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.CoRR, abs/2405.04434, 2024a. DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024b. Fedus, W., Zoph, B., and Shazeer, N. Switch transform- ers: Scaling to trillion parameter models with simple and efficient sparsity.J. Mach. Learn. Res., 23(120):1–39,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Learning mix- tures of experts with EM: A mirror descent perspective

Fruytier, Q., Mokhtari, A., and Sanghavi, S. Learning mix- tures of experts with EM: A mirror descent perspective. InForty-second International Conference on Machine Learning, ICML 2025,

work page 2025

[10] [10]

Advancing expert specialization for better MoE

Guo, H., Lu, H., Nan, G., Chu, B., Zhuang, J., Yang, Y ., Che, W., Leng, S., Cui, Q., and Jiang, X. Advancing expert specialization for better MoE. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2025, NeurIPS 2025,

work page 2025

[11] [11]

Harder task needs more experts: Dynamic routing in MoE models

10 ϕ-Balancing for Mixture-of-Experts Training Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y ., Xu, K., Chen, L., Huang, S., and Feng, Y . Harder task needs more experts: Dynamic routing in MoE models. InProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), ACL 2024, pp. 12883–12895,

work page 2024

[12] [12]

Gemma 3 Technical Report

Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram´e, A., Rivi`ere, M., Rouillard, L., Mesnard, T., Cideron, G., Grill, J., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., Visin, F., Kenealy, K., Beyer, L., Zhai, X., Tsitsulin, A., Busa-Fekete, R., Feng, A., Sachdeva, N., Coleman, B., Gao, Y ...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Revisit visual prompt tuning: The expressiveness of prompt experts

Le, M., Nguyen, A., Nguyen, H., Nguyen, C., Tran, A., and Ho, N. Revisit visual prompt tuning: The expressiveness of prompt experts. InThe Fourteenth International Con- ference on Learning Representations, ICLR 2026,

work page 2026

[14] [14]

GShard: Scaling giant models with conditional computation and automatic sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In9th International Conference on Learning Representations, ICLR 2021,

work page 2021

[15] [15]

co/datasets/AI-MO/NuminaMath-CoT

URL https://huggingface. co/datasets/AI-MO/NuminaMath-CoT. Li, S., Tadiparthi, V ., Lee, K., Agarwal, N., Mahjoub, H. N., Moradi-Pari, E., Chen, L., Zhang, A., and Leqi, L. Learn- ing robust reasoning through guided adversarial self-play. CoRR, abs/2602.00173,

work page arXiv

[16] [16]

Liang, K., Liu, B., Chen, L., and Liu, Q

URL https://github.com/ google-deepmind/simply. Liang, K., Liu, B., Chen, L., and Liu, Q. Memory-efficient LLM training with online subspace descent. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024

[17] [17]

Cautious opti- mizers: Improving training with one line of code

Liang, K., Chen, L., Liu, B., and Liu, Q. Cautious opti- mizers: Improving training with one line of code. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026

[18] [18]

Let’s verify step by step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024,

work page 2024

[19] [19]

Communication efficient distributed training with distributed Lion

Liu, B., Wu, L., Chen, L., Liang, K., Zhu, J., Liang, C., Krishnamoorthi, R., and Liu, Q. Communication efficient distributed training with distributed Lion. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024

[20] [20]

Muon is Scalable for LLM Training

11 ϕ-Balancing for Mixture-of-Experts Training Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for LLM training.CoR...

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Coupling experts and routers in mixture-of-experts via an auxiliary loss

Lv, A., Ma, J., Ma, Y ., and Qiao, S. Coupling experts and routers in mixture-of-experts via an auxiliary loss. In The Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026

[22] [22]

A comprehen- sive survey of mixture-of-experts: Algorithms, theory, and applications.arXiv preprint arXiv:2503.07137,

Mu, S. and Lin, S. A comprehensive survey of mixture- of-experts: Algorithms, theory, and applications.CoRR, abs/2503.07137,

work page arXiv

[23] [23]

P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N

Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Mor- rison, J., Min, S., Shi, W., Walsh, E. P., Tafjord, O., Lambert, N., Gu, Y ., Arora, S., Bhagia, A., Schwenk, D., Wadden, D., Wettig, A., Hui, B., Dettmers, T., Kiela, D., Farhadi, A., Smith, N. A., Koh, P. W., Singh, A., and Ha- jishirzi, H. OLMoE: Open mixture-of-experts language models. InThe ...

work page 2025

[24] [24]

Sigmoid gating is more sample efficient than softmax gating in mixture of experts

Nguyen, H., Ho, N., and Rinaldo, A. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024,

work page 2024

[25] [25]

Memory-efficient optimization with factorized Hamiltonian descent

Nguyen, S., Chen, L., Liu, B., and Liu, Q. Memory-efficient optimization with factorized Hamiltonian descent. In International Conference on Artificial Intelligence and Statistics, AISTATS 2025, Proceedings of Machine Learn- ing Research, pp. 2863–2871,

work page 2025

[26] [26]

Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

Omi, N., Sen, S., and Farhadi, A. Load balancing mix- ture of experts with similarity preserving routers.CoRR, abs/2506.14038,

work page arXiv

[27] [27]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.CoRR, abs/2508.10925,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

P., and Liu, Q

Peng, B., Chen, L., Su, B., Quesnelle, J., Kingma, D. P., and Liu, Q. DeMo: Decoupled momentum optimization. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026

[29] [29]

Qwen3 Technical Report

Qiu, Z., Huang, Z., Cheng, S., Zhou, Y ., Wang, Z., Titov, I., and Fu, J. Layerwise recurrent router for mixture-of- experts. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025a. Qiu, Z., Huang, Z., Zheng, B., Wen, K., Wang, Z., Men, R., Titov, I., Liu, D., Zhou, J., and Lin, J. Demons in the detail: On implementing loa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Y ., Awan, A

Rajbhandari, S., Li, C., Yao, Z., Zhang, M., Aminabadi, R. Y ., Awan, A. A., Rasley, J., and He, Y . DeepSpeed- MoE: Advancing mixture-of-experts inference and train- ing to power next-generation AI scale. InInternational Conference on Machine Learning, ICML 2022, Proceed- ings of Machine Learning Research, pp. 18332–18346,

work page 2022

[31] [31]

SQuAD: 100,000+ questions for machine comprehension of text

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pp. 2383–2392,

work page 2016

[32] [32]

S., Keysers, D., and Houlsby, N

Riquelme, C., Puigcerver, J., Mustafa, B., Neumann, M., Jenatton, R., Pinto, A. S., Keysers, D., and Houlsby, N. Scaling vision with sparse mixture of experts. InAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, pp. 8583–8595,

work page 2021

[33] [33]

GLU Variants Improve Transformer

Shazeer, N. GLU variants improve transformer.CoRR, abs/2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002

[34] [34]

V ., Hinton, G

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q. V ., Hinton, G. E., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In5th International Conference on Learning Rep- resentations, ICLR 2017,

work page 2017

[35] [35]

D., Ng, A

12 ϕ-Balancing for Mixture-of-Experts Training Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y ., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, pp. 1631–1642,

work page 2013

[36] [36]

Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H

URL https://openreview.net/forum? id=kW5hSRG5wq. Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., and Wei, J. Challenging BIG-bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, volume ACL 2023 ofFindings...

work page 2023

[37] [37]

H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N

Thai, G. H., Vu, H.-N., Phan, A.-M., Ly, Q.-T., Dinh, T., Nguyen, T.-N.-T., and Ho, N. SAGE: Shape-adapting gated experts for adaptive histopathology image segmen- tation.CoRR, abs/2511.18493,

work page arXiv

[38] [38]

Towards greater leverage: Scaling laws for efficient mixture-of-experts language models

Tian, C., Chen, K., Liu, J., Liu, Z., Zhang, Z., and Zhou, J. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. InThe Fourteenth International Conference on Learning Representations, ICLR 2026,

work page 2026

[39] [39]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and anal- ysis platform for natural language understanding. In7th International Conference on Learning Representations, ICLR 2019,

work page 2019

[40] [40]

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Wang, A., Sun, X., Xie, R., Li, S., Zhu, J., Yang, Z., Zhao, P., Han, W., Kang, Z., Wang, D., Okazaki, N., and Xu, C. HMoE: Heterogeneous mixture of experts for lan- guage modeling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, pp. 21943–21957, 2025a. Wang, L., Gao, H., Zhao, C., Sun, X., and Dai, D....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

ReMoE: Fully differentiable mixture-of-experts with ReLU routing

Wang, Z., Zhu, J., and Chen, J. ReMoE: Fully differentiable mixture-of-experts with ReLU routing. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, 2025b. Warstadt, A., Singh, A., and Bowman, S. R. Neural net- work acceptability judgments.Trans. Assoc. Comput. Linguistics, 7:625–641,

work page 2025

[42] [42]

Williams, A., Nangia, N., and Bowman, S. R. A broad- coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, NAACL-HLT 2018, pp. 1112–1122,

work page 2018

[43] [43]

TC- MoE: Augmenting mixture of experts with ternary expert choice

Yan, S., Bin, X., Zhang, S., Wang, Y ., and Lin, Z. TC- MoE: Augmenting mixture of experts with ternary expert choice. InThe Thirteenth International Conference on Learning Representations, ICLR 2025,

work page 2025

[44] [44]

Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

Yang, J. Latent prototype routing: Achieving near- perfect load balancing in mixture-of-experts.CoRR, abs/2506.21328,

work page arXiv

[45] [45]

AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models

Zeng, Z., Miao, Y ., Gao, H., Zhang, H., and Deng, Z. AdaMoE: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, Findings of ACL, pp. 6223–6235,

work page 2024

[46] [46]

FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a

Zhang, B., Chen, X., Zhang, S., Zhang, S., Zhou, X., and Sun, M. FLEX-MoE: Federated mixture-of-experts with load-balanced expert assignment.CoRR, abs/2512.23070, 2025a. Zhang, D., Song, J., Bi, Z., Yuan, Y ., Wang, T., Yeong, J., and Hao, J. Mixture of experts in large language models. CoRR, abs/2507.11181, 2025b. Zhang, K., Li, B., Zhang, P., Pu, F., Ca...

work page internal anchor Pith review arXiv 2025

[47] [47]

Y ., Dai, A

Zhou, Y ., Lei, T., Liu, H., Du, N., Huang, Y ., Zhao, V . Y ., Dai, A. M., Chen, Z., Le, Q. V ., and Laudon, J. Mixture- of-experts with expert choice routing. InAdvances in Neural Information Processing Systems 35: Annual Con- ference on Neural Information Processing Systems 2022, NeurIPS 2022,

work page 2022

[48] [48]

• Load is the average (ideal balanced) load across experts

is a metric that quantifies load imbalance in MoE models, defined as MaxVioglobal = maxe Loade − Load Load , where • Load e is the number of tokens assigned to experte. • Load is the average (ideal balanced) load across experts. A lower value indicates more balanced expert utilization, while a higher value reflects severe imbalance. It evaluates global lo...

work page 1970

[49] [49]

requires that the gradient of its objective inqvanish atq t+1, i.e. ∇qF(q t;p t)− 1 η (∇ϕ∗(qt+1)− ∇ϕ ∗(qt)) =0.(14) Substituting (13) and∇ϕ ∗(qt) =m t into (14) and rearranging yields the primal-space update mt+1 =∇ϕ ∗(qt+1)←m t +η(p t −m t) = (1−η)m t +ηp t, which is a convex combination of mt and pt. Since Dϕ is convex, the iterate mt+1 remains in Dϕ. M...

work page 2026