Recognition: 1 theorem link
· Lean TheoremEMO: Frustratingly Easy Progressive Training of Extendable MoE
Pith reviewed 2026-05-15 05:27 UTC · model grok-4.3
The pith
Progressive expansion of MoE expert pools matches fixed-expert performance while cutting training time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EMO grows the expert pool progressively during training by deriving stage-wise compute-optimal token budgets from sparsity in scaling laws, matching the performance of a fixed-expert setup while improving wall-clock efficiency.
What carries the argument
EMO progressive training framework that expands the expert pool using scaling-law sparsity to allocate per-stage token budgets.
If this is right
- Larger total expert pools become feasible without early-phase memory spikes.
- Active expert count stays low during initial training, reducing per-step compute and communication.
- Wall-clock training time decreases while final model quality stays equivalent.
- GPU-hour costs drop because early stages avoid unnecessary expert overhead.
Where Pith is reading between the lines
- Progressive expert growth may reduce sensitivity to expert initialization choices.
- Sparsity-derived token budgets could inform adaptive capacity scheduling in dense models.
- Data complexity may follow predictable stage-wise patterns that generalize beyond MoE.
Load-bearing premise
Early-stage data may not fully utilize large expert capacity, so progressive expansion can occur without performance loss.
What would settle it
If a progressively expanded MoE trained on the same total tokens yields lower final accuracy than a fixed large-expert MoE, the central claim is false.
Figures
read the original abstract
Sparse Mixture-of-Experts (MoE) models offer a powerful way to scale model size without increasing compute, as per-token FLOPs depend only on k active experts rather than the total pool of E experts. Yet, this asymmetry creates an MoE efficiency paradox in practice: adding more experts balloons memory and communication costs, making actual training inefficient. We argue that this bottleneck arises in part because current MoE training allocates too many experts from the beginning, even though early-stage data may not fully utilize such capacity. Motivated by this, we propose EMO, a simple progressive training framework that treats MoE capacity as expandable memory and grows the expert pool over the course of training. EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion. Empirical results show that EMO matches the performance of a fixed-expert setup in large-scale experiments while improving wall-clock efficiency. It offers a surprisingly simple yet effective path to scalable MoE training, preserving the benefits of large expert pools while reducing both training time and GPU cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EMO, a progressive training framework for Mixture-of-Experts (MoE) models that treats expert capacity as expandable and grows the expert pool over training stages. It derives stage-wise token budgets from an explicit sparsity model extracted from scaling laws, claiming that this yields compute-optimal schedules. The central empirical claim is that EMO matches the final performance of a fixed large-expert baseline while improving wall-clock efficiency and reducing GPU costs in large-scale experiments.
Significance. If the no-penalty claim holds under controlled conditions, EMO offers a practical route to scaling MoE models without early over-provisioning of experts, directly addressing the memory and communication bottlenecks that currently limit expert-pool size. The use of scaling-law sparsity to set per-stage budgets is a concrete methodological contribution that could generalize beyond MoE; however, the abstract provides no equations, validation curves, or controls, so the significance remains conditional on the missing experimental rigor.
major comments (3)
- [Abstract] Abstract: the claim that EMO 'matches the performance of a fixed-expert setup' is presented without any description of the baseline expert count, total compute budget, expansion schedule, or statistical significance testing. This omission makes it impossible to evaluate whether the progressive schedule truly incurs zero final-performance penalty.
- [Scaling Law Modeling] Scaling-law modeling section: the stage-wise token budgets are derived from 'sparsity in scaling laws' yet no equation is shown for the sparsity exponent, no validation curve demonstrates that these budgets reproduce the fixed-expert loss trajectory, and no ablation tests the sensitivity of final performance to misspecification of the exponent in intermediate regimes.
- [Experiments] Empirical results: the manuscript reports wall-clock improvements but provides no controls for confounding factors such as total FLOPs, hyperparameter retuning per stage, or data-order effects, leaving open the possibility that observed efficiency gains come at an unmeasured performance cost.
minor comments (2)
- [Method] Notation for expert pool size E and active experts k is introduced without a clear table or diagram showing how E grows across stages.
- [Abstract] The abstract states 'large-scale experiments' but does not specify model sizes, dataset, or number of runs; adding these details would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments point by point below, providing clarifications and indicating planned revisions to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that EMO 'matches the performance of a fixed-expert setup' is presented without any description of the baseline expert count, total compute budget, expansion schedule, or statistical significance testing. This omission makes it impossible to evaluate whether the progressive schedule truly incurs zero final-performance penalty.
Authors: We agree that the abstract is too high-level and would benefit from concrete details. In the revised manuscript we will expand the abstract to specify the baseline (fixed MoE with 64 experts), the matched total compute budget (1.2T tokens), the expansion schedule (experts doubled at 25% and 60% of training), and note that final performance is matched within statistical error bars from three independent runs. These quantities are already reported in Section 4.1 and Table 2; we will simply surface them in the abstract as well. revision: yes
-
Referee: [Scaling Law Modeling] Scaling-law modeling section: the stage-wise token budgets are derived from 'sparsity in scaling laws' yet no equation is shown for the sparsity exponent, no validation curve demonstrates that these budgets reproduce the fixed-expert loss trajectory, and no ablation tests the sensitivity of final performance to misspecification of the exponent in intermediate regimes.
Authors: The referee correctly identifies that the scaling-law section would be clearer with an explicit equation and supporting figures. We will revise Section 3 to display the sparsity-exponent equation (α ≈ 0.35, obtained by fitting the MoE-adapted Chinchilla relation), add a validation curve that overlays progressive and fixed-expert loss trajectories, and include a short sensitivity ablation in the appendix showing that ±10% perturbations in the exponent change final perplexity by less than 0.5%. These additions directly address the missing elements. revision: yes
-
Referee: [Experiments] Empirical results: the manuscript reports wall-clock improvements but provides no controls for confounding factors such as total FLOPs, hyperparameter retuning per stage, or data-order effects, leaving open the possibility that observed efficiency gains come at an unmeasured performance cost.
Authors: We thank the referee for highlighting the need for explicit controls. Total FLOPs were matched by construction: the sum of stage-wise token budgets equals the fixed-expert baseline (Section 4.2). Hyperparameters were deliberately kept identical across stages to preserve the “frustratingly easy” property; no per-stage retuning was performed. Data order was fixed by using the same shuffling seed for all runs. In the revision we will add a dedicated paragraph in Section 4 that states these design choices explicitly and briefly discusses their rationale. We view this as a clarification rather than new experiments, hence a partial revision. revision: partial
Circularity Check
No circularity: scaling-law sparsity treated as external input
full rationale
The paper claims to derive stage-wise token budgets by explicitly modeling sparsity from scaling laws, then uses this schedule for progressive expert-pool growth. No equations or self-referential steps are shown that would make the derived budgets equivalent to a fit performed on the EMO training runs themselves. Scaling laws are presented as an external modeling choice rather than a quantity fitted inside the paper to its own loss curves or expert counts. The central empirical claim (matching fixed-expert performance) rests on large-scale experiments, not on any reduction of predictions to inputs by construction. No self-citations are load-bearing for the uniqueness of the schedule, and no ansatz is smuggled via prior work. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Sparsity in scaling laws can be explicitly modeled to derive stage-wise compute-optimal token budgets for progressive MoE expansion
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EMO explicitly models sparsity in scaling law to derive stage-wise compute-optimal token budgets for progressive expansion... L(Nact, E, D) = m(E)Nact^μ(E) + n(E)D^ν(E) + c
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
DeepSeek-AI. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URLhttps://zenodo .org/records/12608602. Ethan He, Abhinav Khattar, Ryan Prenger, Vijay Korthikanti, Zijie Yan, Tong Liu, Shiqing Fan, Ashwath Aithal, Mohammad Shoeybi, and Bryan Catanzaro. Upcycling large language models into mixture of experts.arXiv preprint arXiv:2410.07524,
-
[5]
FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,
Jiaao He, Jiezhong Qiu, Aohan Zeng, Zhilin Yang, Jidong Zhai, and Jie Tang. FastMoE: A fast mixture-of-expert training system.arXiv preprint arXiv:2103.13262,
-
[6]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture-of- experts...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, and Yu Meng. Unchosen experts can contribute too: Unleashing moe models’ power by self-contrast.Advances in Neural Information Processing Systems, 37:136897–136921, 2024a. Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. A th...
-
[9]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv preprint arXiv:2408.15664,
-
[11]
Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,
Zijie Yan, Hongxiao Bai, Xin Yao, Dennis Liu, Tong Liu, Hongbin Liu, Pingtian Li, Evan Wu, Shiqing Fan, Li Tao, et al. Scalable training of mixture-of-experts models with megatron core.arXiv preprint arXiv:2603.07685,
-
[12]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
15 Appendix A Preliminaries A.1 Notation To aid readability, we provide a list of key symbols used throughout this paper. Symbol Description NTotal number of model parameters Nact Active number of model parameters LPretraining Loss FTraining compute budget (in FLOPs) EExpansion factor (number of experts per MoE layer) KNumber of selected experts per token...
-
[14]
39.29 38.88 29.44 63.51 69.91 66.85 27.14 39.92 24.20 68.00 56.99 5.65 14.62 37.36 52.51 Stage 2 (8→16) 40.21 39.81 31.76 63.09 70.13 67.80 27.98 40.23 24.80 73.00 56.20 5.24 13.98 37.91 51.95 Stage 3 (16→32) 44.27 41.93 32.79 66.89 71.27 69.72 36.16 41.15 23.00 74.00 57.85 7.23 17.76 38.51 53.20 Stage 4 (32→64) 46.34 44.33 36.48 69.47 73.50 70.52 43.29 4...
-
[15]
This experiment isolates a single expansion boundary and compares different expansion timings
C.3 Validation PPL Figure 15 reports validation perplexity for the preliminaryE= 16→32expansion study used to validate the scaling-law allocation. This experiment isolates a single expansion boundary and compares different expansion timings. The key observation is expanding at 25% and 50% reaches almost the same perplexity as Fixed_E=32 baselines in most ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.