Recognition: 2 theorem links
· Lean TheoremHierarchical Mixture-of-Experts with Two-Stage Optimization
Pith reviewed 2026-05-12 01:54 UTC · model grok-4.3
The pith
Hierarchical routing in MoE models reduces perplexity by 5.6 percent and improves expert balance by 40 percent in 7B-scale pre-training on 58 billion tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hi-MoE introduces a grouped MoE framework that decomposes routing control into two coupled levels: inter-group balancing that enforces fair traffic across expert groups, and intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. This hierarchical structure, combined with a two-stage optimization procedure, reshapes the router to promote stable specialization and mitigate collapse, resulting in consistent improvements over baselines and a 5.6 percent perplexity reduction with 40 percent better expert balance in 7B-scale pre-training on 58B tokens.
What carries the argument
The two coupled hierarchical objectives of inter-group balancing and intra-group specialization, jointly optimized via a two-stage procedure that separates control of traffic fairness from within-group diversity.
If this is right
- The improvements remain consistent as model size and expert count are scaled upward.
- Targeted ablations confirm that both inter-group and intra-group levels are necessary for the observed stability.
- Gains appear across diverse NLP and vision evaluation domains after the same pre-training regime.
- The two-stage procedure enables the joint objectives to be trained without collapse using standard optimizer settings.
Where Pith is reading between the lines
- Similar hierarchical decompositions could be tested in other conditional computation settings such as dynamic depth networks.
- The approach may reduce the hyperparameter burden when moving to models larger than 7B by limiting collapse modes at each scale.
- Practitioners could apply the inter-group versus intra-group split to improve load balancing in non-language sparse architectures.
Load-bearing premise
The two coupled hierarchical objectives of inter-group balancing and intra-group specialization can be jointly optimized in a stable manner without introducing new collapse modes or requiring extensive additional hyperparameter search beyond the described two-stage procedure.
What would settle it
A replication of the 58B-token pre-training run for Hi-MoE-7B and OLMoE-7B in which the proposed model shows neither the reported perplexity reduction nor the 40 percent expert balance improvement would falsify the central performance claims.
Figures
read the original abstract
Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity often causes routing collapse. We propose Hi-MoE, a grouped MoE framework that decomposes routing control into two coupled levels: (i) inter-group balancing that enforces fair traffic across expert groups, and (ii) intra-group specialization that promotes complementary expert behaviors while preventing within-group collapse. Our analysis provides a principled explanation of how our hierarchical objectives reshape the router, thereby promoting stable specialization and mitigating collapse. We observe consistent improvements over recent sparse-routing and grouped-MoE baselines across NLP and vision benchmarks, and confirm robustness via scaling studies (model size, expert count) and targeted ablations. In large-scale pre-training on 58B tokens, Hi-MoE-7B achieves a 5.6% perplexity reduction and a 40% improvement in expert balance over OLMoE-7B across diverse evaluation domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hi-MoE, a hierarchical grouped Mixture-of-Experts architecture that decomposes router control into inter-group balancing (fair traffic across expert groups) and intra-group specialization (complementary behaviors without within-group collapse). These are jointly optimized via a two-stage procedure. The central empirical claims are consistent gains over sparse-routing and grouped-MoE baselines on NLP/vision tasks, plus a 5.6% perplexity reduction and 40% expert-balance improvement for Hi-MoE-7B versus OLMoE-7B after pre-training on 58B tokens, supported by scaling studies and ablations.
Significance. If the two-stage procedure stably resolves the balancing-specialization trade-off without new collapse modes, the work would provide a practical, scalable lever for MoE router design that could improve both efficiency and capacity utilization in large sparse models.
major comments (3)
- [§3] §3 (two-stage optimization procedure): the manuscript provides no explicit loss formulations, weighting schedule, or hyper-parameter values for the coupled inter-group balancing and intra-group diversity terms. Without these, it is impossible to determine whether the reported stability and gains arise from the hierarchical decomposition itself or from unreported tuning that masks potential group-level under-utilization or intra-group collapse.
- [§4.2, Table 2] §4.2 and Table 2 (large-scale pre-training results): the 5.6% PPL reduction and 40% balance improvement versus OLMoE-7B are stated without error bars, multiple random seeds, or statistical tests. Given that MoE training variance is typically high, these point estimates alone do not establish that the hierarchical objectives reliably outperform the baseline.
- [§4.3] §4.3 (ablations): the ablation studies do not isolate the contribution of the two-stage schedule versus simply adding the two balancing terms simultaneously; a direct comparison is needed to confirm that the staged procedure is load-bearing for the claimed mitigation of collapse modes.
minor comments (2)
- Notation for the router logits and group assignment variables is introduced without a consolidated table; a single reference table would improve readability.
- [Figure 3] Figure 3 (expert utilization heatmaps) lacks axis labels on the color scale and does not indicate the number of tokens sampled per domain.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, indicating the revisions we will incorporate to improve clarity, rigor, and completeness of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (two-stage optimization procedure): the manuscript provides no explicit loss formulations, weighting schedule, or hyper-parameter values for the coupled inter-group balancing and intra-group diversity terms. Without these, it is impossible to determine whether the reported stability and gains arise from the hierarchical decomposition itself or from unreported tuning that masks potential group-level under-utilization or intra-group collapse.
Authors: We agree that the loss formulations and hyperparameters were insufficiently detailed in the original submission. In the revised manuscript we will expand Section 3 with the explicit equations for the inter-group balancing loss (L_inter = Σ_g |load_g - 1/G|^2) and intra-group diversity loss (L_intra = -Σ entropy of expert activations within groups), the combined objective with weighting coefficients λ_inter and λ_intra, the precise two-stage schedule (stage 1 optimizes only balancing for the first 10% of training steps, stage 2 activates both terms), and the concrete hyperparameter values used in all experiments together with a brief sensitivity discussion. These additions will make clear that the reported gains derive from the hierarchical decomposition rather than undisclosed tuning. revision: yes
-
Referee: [§4.2, Table 2] §4.2 and Table 2 (large-scale pre-training results): the 5.6% PPL reduction and 40% balance improvement versus OLMoE-7B are stated without error bars, multiple random seeds, or statistical tests. Given that MoE training variance is typically high, these point estimates alone do not establish that the hierarchical objectives reliably outperform the baseline.
Authors: We acknowledge that single-run point estimates are insufficient to demonstrate reliability given known MoE training variance. Because of the prohibitive cost of 58 B-token pre-training, we performed only one run for the 7 B model. In the revision we will explicitly note this limitation, add error bars and multi-seed results (minimum three seeds) with statistical tests for all smaller-scale experiments in Tables 1, 3 and 4, and retain the large-scale numbers with an appropriate caveat while emphasizing the consistent trends across model scales and tasks. revision: partial
-
Referee: [§4.3] §4.3 (ablations): the ablation studies do not isolate the contribution of the two-stage schedule versus simply adding the two balancing terms simultaneously; a direct comparison is needed to confirm that the staged procedure is load-bearing for the claimed mitigation of collapse modes.
Authors: We accept that the existing ablations do not directly isolate the staging procedure. The revised manuscript will include a new ablation subsection and table that compares (i) the full two-stage Hi-MoE, (ii) a single-stage variant that optimizes both inter-group and intra-group losses jointly from the start, and (iii) the individual-term baselines. Results will quantify the additional benefit of staging in preventing collapse and improving expert utilization, thereby confirming the load-bearing role of the two-stage schedule. revision: yes
- We cannot rerun the 58 B-token 7 B-model pre-training with multiple random seeds because of the prohibitive computational resources required.
Circularity Check
No circularity; empirical claims rest on external benchmarks
full rationale
The paper proposes a hierarchical MoE architecture and two-stage optimization procedure, then reports empirical gains (perplexity, balance) against external baselines such as OLMoE-7B on 58B-token pre-training. No mathematical derivation chain, loss-function identities, or fitted-parameter predictions are exhibited that reduce to the paper's own inputs by construction. The 'principled explanation' is described at a high level without equations that could be self-referential. All load-bearing claims are falsifiable via the reported scaling studies and ablations, satisfying the criteria for non-circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- inter-group balancing strength
- intra-group diversity coefficient
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, washburn_uniqueness_aczel)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
L = L_task + L_load + R_intra + R_inter ... min L_task + L_load s.t. C_sys ≤ ε_sys, C_ov ≤ ε_ov ... R_inter = λ_inter ||eπ(x)||²₂ ... R_intra = −λ_intra ||π(x)||²₂
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. 2022. On the repre- sentation collapse of sparse mixture of experts.Advances in Neural Information Processing Systems35 (2022), 34600–34613
work page 2022
-
[3]
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. 2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [4]
-
[5]
Richard D De Veaux. 1989. Mixtures of linear regressions.Computational Statistics & Data Analysis8, 3 (1989), 227–245
work page 1989
-
[6]
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al
-
[7]
In International conference on machine learning
Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569
-
[8]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research23, 120, 1–39
work page 2022
-
[9]
Jürgen Fritsch, Michael Finke, and Alex Waibel. 1996. Adaptively growing hierarchical mixtures of experts.Advances in Neural Information Processing Systems9 (1996)
work page 1996
-
[10]
Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. 2023. Megablocks: Efficient sparse training with mixture-of-experts.Proceedings of Machine Learning and Systems5 (2023), 288–304
work page 2023
- [11]
-
[12]
Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText Corpus. http:// Skylion007.github.io/OpenWebTextCorpus
work page 2019
-
[13]
Yu Han, Lehan Pan, Jie Peng, Ziyang Tao, Wuyang Zhang, and Yanyong Zhang
-
[14]
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference.arXiv preprint arXiv:2509.25041(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang
- [16]
-
[17]
Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. InProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134
work page 2022
- [18]
-
[19]
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. 1991. Adaptive mixtures of local experts.Neural computation3, 1 (1991), 79–87
work page 1991
-
[20]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Michael I Jordan and Robert A Jacobs. 1994. Hierarchical mixtures of experts and the EM algorithm.Neural computation6, 2 (1994), 181–214
work page 1994
-
[22]
Andrej Karpathy. 2022. NanoGPT. https://github.com/karpathy/nanoGPT
work page 2022
-
[23]
Ya Le and Xuan S. Yang. 2015. Tiny ImageNet Visual Recognition Challenge
work page 2015
-
[24]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668(2020)
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[25]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Boan Liu, Liang Ding, Li Shen, Keqin Peng, Yu Cao, Dazhao Cheng, and Dacheng Tao. 2024. Diversifying the mixture-of-experts representation for language models with orthogonal optimizer. InECAI 2024: 27th European Conference on Artificial Intelligence, 19–24 October 2024, Santiago de Compostela, Spain–Including 13th Conference on Prestigious Applications o...
work page 2024
-
[27]
Yong Liu and Xin Yao. 1999. Ensemble learning via negative correlation.Neural networks12, 10 (1999), 1399–1404
work page 1999
-
[28]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer Molodtsov et al. 2026 using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision. 10012–10022
work page 2021
- [29]
-
[30]
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. 2022. Multimodal contrastive learning with limoe: the language-image mixture of experts.Advances in Neural Information Processing Systems35 (2022), 9564–9576
work page 2022
- [31]
-
[32]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [33]
-
[34]
Daria Soboleva. 2025. Router Wars: Which MoE Routing Strategy Actually Works. https://www.cerebras.ai/blog/moe-guide-router
work page 2025
-
[35]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkin- son, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. 2024. Dolma: An open corpus of three trillion tokens for language model pretraining research. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: lo...
work page 2024
- [36]
- [37]
- [38]
- [39]
-
[40]
Lei Xu, Michael Jordan, and Geoffrey E Hinton. 1994. An alternative model for mixtures of experts.Advances in neural information processing systems7 (1994)
work page 1994
- [41]
- [42]
-
[43]
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, and James Dean. 2022. Mixture-of-experts with expert choice routing. InAdvances in Neural Information Processing Systems. 7103–7114
work page 2022
- [44]
-
[45]
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. 2022. ST-MoE: Designing stable and transfer- able sparse expert models.arXiv preprint arXiv:2202.08906(2022). A Experimental Reproducibility Checklist A.1Hi-MoEfitting hyperparameters To ensure reproducibility and simplify adoption, we summarize the ...
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.