arxiv: 2605.08292 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· math.OC

Recognition: 2 theorem links

· Lean Theorem

Hierarchical Mixture-of-Experts with Two-Stage Optimization

Gleb Molodtsov , Alexander Miasnikov , Aleksandr Beznosikov

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords mixture of expertssparse modelshierarchical routingload balancingexpert specializationlarge language modelspre-trainingrouting collapse

0 comments

The pith

Hierarchical routing in MoE models reduces perplexity by 5.6 percent and improves expert balance by 40 percent in 7B-scale pre-training on 58 billion tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hi-MoE to address the trade-off in Mixture-of-Experts models where strong load balancing can suppress expert specialization and aggressive diversity can cause routing collapse. It decomposes routing control into inter-group balancing that enforces fair traffic across expert groups and intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. The framework uses a two-stage optimization to jointly train these coupled objectives. In large-scale experiments, this yields a 5.6 percent perplexity reduction and 40 percent better expert balance compared to OLMoE-7B after training on 58 billion tokens, with gains holding across NLP and vision benchmarks.

Core claim

Hi-MoE introduces a grouped MoE framework that decomposes routing control into two coupled levels: inter-group balancing that enforces fair traffic across expert groups, and intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. This hierarchical structure, combined with a two-stage optimization procedure, reshapes the router to promote stable specialization and mitigate collapse, resulting in consistent improvements over baselines and a 5.6 percent perplexity reduction with 40 percent better expert balance in 7B-scale pre-training on 58B tokens.

What carries the argument

The two coupled hierarchical objectives of inter-group balancing and intra-group specialization, jointly optimized via a two-stage procedure that separates control of traffic fairness from within-group diversity.

If this is right

The improvements remain consistent as model size and expert count are scaled upward.
Targeted ablations confirm that both inter-group and intra-group levels are necessary for the observed stability.
Gains appear across diverse NLP and vision evaluation domains after the same pre-training regime.
The two-stage procedure enables the joint objectives to be trained without collapse using standard optimizer settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hierarchical decompositions could be tested in other conditional computation settings such as dynamic depth networks.
The approach may reduce the hyperparameter burden when moving to models larger than 7B by limiting collapse modes at each scale.
Practitioners could apply the inter-group versus intra-group split to improve load balancing in non-language sparse architectures.

Load-bearing premise

The two coupled hierarchical objectives of inter-group balancing and intra-group specialization can be jointly optimized in a stable manner without introducing new collapse modes or requiring extensive additional hyperparameter search beyond the described two-stage procedure.

What would settle it

A replication of the 58B-token pre-training run for Hi-MoE-7B and OLMoE-7B in which the proposed model shows neither the reported perplexity reduction nor the 40 percent expert balance improvement would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2605.08292 by Aleksandr Beznosikov, Alexander Miasnikov, Gleb Molodtsov.

**Figure 1.** Figure 1: Diagram of the proposed Hi-MoE. Experts are organized into hierarchical groups, promoting complementary specialization within groups and balanced, device-level utilization across groups during routing. it to the context of grouped experts and large-scale sparse transformers. In particular, we merge recent grouped-expert designs with a two-level optimization hierarchy. At the intra-group level, we encour… view at source ↗

**Figure 2.** Figure 2: PPL-CV trade-off. that Hi-MoE expands the Pareto frontier relative to grouped baselines (obtained with 𝜆intra = 𝜆inter = 0), improving balance without sacrificing quality, and enabling predictable tuning between the two. 4 Analysis Hi-MoE is designed to satisfy two requirements that are simultaneously important at scale: (i) systems-aware balance (uniform GPU/group utilization), and (ii) functional diver… view at source ↗

**Figure 3.** Figure 3: Group-level attention patterns of Swin Transformer with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Expert activation frequency distribution in the 7th [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Expert activation heatmaps across all 12 MoE layers [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Expert activation frequency distribution in the 5th [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Group workload distribution. Although the aggregate group workload in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hi-MoE's two-stage hierarchical routing improves balance and perplexity on 58B-token runs over OLMoE, but the stability of the coupled objectives rests on thin optimization details.

read the letter

The paper's core move is to split MoE routing into an inter-group balancing term and an intra-group specialization term, then optimize them in two stages. On a 7B model trained from scratch on 58B tokens, Hi-MoE reports a 5.6% perplexity drop and 40% better expert balance than OLMoE-7B across several domains, plus scaling curves and some ablations. That is the concrete result worth noting first. The decomposition itself is a clean way to restate the classic load-balance versus diversity tension, and the empirical numbers at this token scale are the part that could matter for people training large sparse models. The authors also run size and expert-count sweeps, which is better than many MoE papers that stop at small ablations. The main soft spot is that the abstract and available description give no explicit loss equations, weighting schedule, or convergence diagnostics for the two-stage procedure. Without those, it is difficult to judge whether the reported gains are robust to the coupling or whether they depend on extra hyper-parameter effort that is not shown. The concern about possible new collapse modes at group level is reasonable until we see the actual training curves or failure-case analysis. The work is aimed at researchers who already tune MoE routers for large pre-training. A reader who cares about routing stability at frontier scale will find the empirical comparison useful even if the method section needs more expansion. It is worth sending to peer review because the scale of the main experiment and the direct comparison to a recent baseline make the claims worth checking in detail.

Referee Report

3 major / 2 minor

Summary. The paper proposes Hi-MoE, a hierarchical grouped Mixture-of-Experts architecture that decomposes router control into inter-group balancing (fair traffic across expert groups) and intra-group specialization (complementary behaviors without within-group collapse). These are jointly optimized via a two-stage procedure. The central empirical claims are consistent gains over sparse-routing and grouped-MoE baselines on NLP/vision tasks, plus a 5.6% perplexity reduction and 40% expert-balance improvement for Hi-MoE-7B versus OLMoE-7B after pre-training on 58B tokens, supported by scaling studies and ablations.

Significance. If the two-stage procedure stably resolves the balancing-specialization trade-off without new collapse modes, the work would provide a practical, scalable lever for MoE router design that could improve both efficiency and capacity utilization in large sparse models.

major comments (3)

[§3] §3 (two-stage optimization procedure): the manuscript provides no explicit loss formulations, weighting schedule, or hyper-parameter values for the coupled inter-group balancing and intra-group diversity terms. Without these, it is impossible to determine whether the reported stability and gains arise from the hierarchical decomposition itself or from unreported tuning that masks potential group-level under-utilization or intra-group collapse.
[§4.2, Table 2] §4.2 and Table 2 (large-scale pre-training results): the 5.6% PPL reduction and 40% balance improvement versus OLMoE-7B are stated without error bars, multiple random seeds, or statistical tests. Given that MoE training variance is typically high, these point estimates alone do not establish that the hierarchical objectives reliably outperform the baseline.
[§4.3] §4.3 (ablations): the ablation studies do not isolate the contribution of the two-stage schedule versus simply adding the two balancing terms simultaneously; a direct comparison is needed to confirm that the staged procedure is load-bearing for the claimed mitigation of collapse modes.

minor comments (2)

Notation for the router logits and group assignment variables is introduced without a consolidated table; a single reference table would improve readability.
[Figure 3] Figure 3 (expert utilization heatmaps) lacks axis labels on the color scale and does not indicate the number of tokens sampled per domain.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, indicating the revisions we will incorporate to improve clarity, rigor, and completeness of the manuscript.

read point-by-point responses

Referee: [§3] §3 (two-stage optimization procedure): the manuscript provides no explicit loss formulations, weighting schedule, or hyper-parameter values for the coupled inter-group balancing and intra-group diversity terms. Without these, it is impossible to determine whether the reported stability and gains arise from the hierarchical decomposition itself or from unreported tuning that masks potential group-level under-utilization or intra-group collapse.

Authors: We agree that the loss formulations and hyperparameters were insufficiently detailed in the original submission. In the revised manuscript we will expand Section 3 with the explicit equations for the inter-group balancing loss (L_inter = Σ_g |load_g - 1/G|^2) and intra-group diversity loss (L_intra = -Σ entropy of expert activations within groups), the combined objective with weighting coefficients λ_inter and λ_intra, the precise two-stage schedule (stage 1 optimizes only balancing for the first 10% of training steps, stage 2 activates both terms), and the concrete hyperparameter values used in all experiments together with a brief sensitivity discussion. These additions will make clear that the reported gains derive from the hierarchical decomposition rather than undisclosed tuning. revision: yes
Referee: [§4.2, Table 2] §4.2 and Table 2 (large-scale pre-training results): the 5.6% PPL reduction and 40% balance improvement versus OLMoE-7B are stated without error bars, multiple random seeds, or statistical tests. Given that MoE training variance is typically high, these point estimates alone do not establish that the hierarchical objectives reliably outperform the baseline.

Authors: We acknowledge that single-run point estimates are insufficient to demonstrate reliability given known MoE training variance. Because of the prohibitive cost of 58 B-token pre-training, we performed only one run for the 7 B model. In the revision we will explicitly note this limitation, add error bars and multi-seed results (minimum three seeds) with statistical tests for all smaller-scale experiments in Tables 1, 3 and 4, and retain the large-scale numbers with an appropriate caveat while emphasizing the consistent trends across model scales and tasks. revision: partial
Referee: [§4.3] §4.3 (ablations): the ablation studies do not isolate the contribution of the two-stage schedule versus simply adding the two balancing terms simultaneously; a direct comparison is needed to confirm that the staged procedure is load-bearing for the claimed mitigation of collapse modes.

Authors: We accept that the existing ablations do not directly isolate the staging procedure. The revised manuscript will include a new ablation subsection and table that compares (i) the full two-stage Hi-MoE, (ii) a single-stage variant that optimizes both inter-group and intra-group losses jointly from the start, and (iii) the individual-term baselines. Results will quantify the additional benefit of staging in preventing collapse and improving expert utilization, thereby confirming the load-bearing role of the two-stage schedule. revision: yes

standing simulated objections not resolved

We cannot rerun the 58 B-token 7 B-model pre-training with multiple random seeds because of the prohibitive computational resources required.

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The paper proposes a hierarchical MoE architecture and two-stage optimization procedure, then reports empirical gains (perplexity, balance) against external baselines such as OLMoE-7B on 58B-token pre-training. No mathematical derivation chain, loss-function identities, or fitted-parameter predictions are exhibited that reduce to the paper's own inputs by construction. The 'principled explanation' is described at a high level without equations that could be self-referential. All load-bearing claims are falsifiable via the reported scaling studies and ablations, satisfying the criteria for non-circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The method introduces tunable coefficients for the inter-group and intra-group objectives whose values are not specified in the abstract; standard assumptions of differentiable routing and SGD convergence are implicit.

free parameters (2)

inter-group balancing strength
Controls traffic fairness across expert groups; value chosen to achieve reported balance improvement.
intra-group diversity coefficient
Promotes complementary expert behavior within groups; value chosen to avoid within-group collapse.

pith-pipeline@v0.9.0 · 5492 in / 1089 out tokens · 59317 ms · 2026-05-12T01:54:12.907120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, washburn_uniqueness_aczel) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L = L_task + L_load + R_intra + R_inter ... min L_task + L_load s.t. C_sys ≤ ε_sys, C_ov ≤ ε_ov ... R_inter = λ_inter ||eπ(x)||²₂ ... R_intra = −λ_intra ||π(x)||²₂

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 7 internal anchors

[1]

Zeyuan Allen-Zhu and Yuanzhi Li. 2020. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning.arXiv preprint arXiv:2012.09816(2020)

work page arXiv 2020
[2]

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the repre- sentation collapse of sparse mixture of experts.Advances in Neural Information Processing Systems35 (2022), 34600–34613

work page 2022
[3]

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Antoine de Mathelin, Francois Deheeger, Mathilde Mougeot, and Nicolas Vay- atis. 2023. Deep anti-regularized ensembles provide reliable out-of-distribution uncertainty quantification.arXiv preprint arXiv:2304.04042(2023)

work page arXiv 2023
[5]

Richard D De Veaux. 1989. Mixtures of linear regressions.Computational Statistics & Data Analysis8, 3 (1989), 227–245

work page 1989
[6]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page
[7]

In International conference on machine learning

Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569

work page
[8]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120, 1–39

work page 2022
[9]

Jürgen Fritsch, Michael Finke, and Alex Waibel. 1996. Adaptively growing hierarchical mixtures of experts.Advances in Neural Information Processing Systems9 (1996)

work page 1996
[10]

Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems5 (2023), 288–304

work page 2023
[11]

Seokjin Go and Divya Mahajan. 2025. Moetuner: Optimized mixture of expert serving with balanced expert placement and token routing.arXiv preprint arXiv:2502.06643(2025)

work page arXiv 2025
[12]

Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus. http:// Skylion007.github.io/OpenWebTextCorpus

work page 2019
[13]

Yu Han, Lehan Pan, Jie Peng, Ziyang Tao, Wuyang Zhang, and Yanyong Zhang

work page
[14]

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference.arXiv preprint arXiv:2509.25041(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang

work page
[16]

Fastmoe: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262(2021)

work page arXiv 2021
[17]

Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134

work page 2022
[18]

Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, Joe Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. 2022. Tutel: Adaptive Mixture-of-Experts at Scale. arXiv:2206.03382

work page arXiv 2022
[19]

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87

work page 1991
[20]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm.Neural computation6, 2 (1994), 181–214

work page 1994
[22]

Andrej Karpathy. 2022. NanoGPT. https://github.com/karpathy/nanoGPT

work page 2022
[23]

Ya Le and Xuan S. Yang. 2015. Tiny ImageNet Visual Recognition Challenge

work page 2015
[24]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Boan Liu, Liang Ding, Li Shen, Keqin Peng, Yu Cao, Dazhao Cheng, and Dacheng Tao. 2024. Diversifying the mixture-of-experts representation for language models with orthogonal optimizer. InECAI 2024: 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain–Including 13th Conference on Prestigious Applications o...

work page 2024
[27]

Yong Liu and Xin Yao. 1999. Ensemble learning via negative correlation.Neural networks12, 10 (1999), 1399–1404

work page 1999
[28]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer Molodtsov et al. 2026 using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022

work page 2021
[29]

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Se- won Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, et al. 2024. Ol- moe: Open mixture-of-experts language models.arXiv preprint arXiv:2409.02060 (2024)

work page arXiv 2024
[30]

Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. 2022. Multimodal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems35 (2022), 9564–9576

work page 2022
[31]

Huy Nguyen, Xing Han, Carl Harris, Suchi Saria, and Nhat Ho. 2024. On expert estimation in hierarchical mixture of experts: Beyond softmax gating functions. arXiv preprint arXiv:2410.02935(2024)

work page arXiv 2024
[32]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Yikang Shen, Zhen Guo, Tianle Cai, and Zengyi Qin. 2024. JetMoE: Reaching Llama2 Performance with 0.1M Dollars.arXiv preprint arXiv:2404.07413(2024)

work page arXiv 2024
[34]

Daria Soboleva. 2025. Router Wars: Which MoE Routing Strategy Actually Works. https://www.cerebras.ai/blog/moe-guide-router

work page 2025
[35]

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkin- son, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. 2024. Dolma: An open corpus of three trillion tokens for language model pretraining research. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: lo...

work page 2024
[36]

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. 2025. Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity.arXiv preprint arXiv:2505.21411(2025)

work page arXiv 2025
[37]

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. 2025. Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519 (2025)

work page arXiv 2025
[38]

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. 2024. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664(2024)

work page arXiv 2024
[39]

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. 2024. Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563(2024)

work page arXiv 2024
[40]

Lei Xu, Michael Jordan, and Geoffrey E Hinton. 1994. An alternative model for mixtures of experts.Advances in neural information processing systems7 (1994)

work page 1994
[41]

Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. 2024. OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models.arXiv preprint arXiv:2402.01739(2024)

work page arXiv 2024
[42]

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. 2025. Comet: Fine- grained computation-communication overlapping for mixture-of-experts.arXiv preprint arXiv:2502.19811(2025)

work page arXiv 2025
[43]

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Dean. 2022. Mixture-of-experts with expert choice routing. InAdvances in Neural Information Processing Systems. 7103–7114

work page 2022
[44]

Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, Jingqi Tong, Conghui He, and Yu Cheng. 2024. Llama-moe: Building mixture-of-experts from llama with continual pre-training.arXiv preprint arXiv:2406.16554(2024)

work page arXiv 2024
[45]

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. ST-MoE: Designing stable and transfer- able sparse expert models.arXiv preprint arXiv:2202.08906(2022). A Experimental Reproducibility Checklist A.1Hi-MoEfitting hyperparameters To ensure reproducibility and simplify adoption, we summarize the ...

work page internal anchor Pith review arXiv 2022