pith. sign in

arxiv: 2605.23893 · v1 · pith:GSIZEKHAnew · submitted 2026-05-22 · 💻 cs.LG

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords hyperparameter transfermixture of expertsscalingtransformermuPSDE approximationpretraining
0
0 comments X

The pith

Hyperparameters tuned on one dense model transfer near-optimally to any Mixture-of-Experts configuration via the Complete-muE rule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to derive a single transfer rule that moves optimal learning rate, weight decay, and related settings from a dense feed-forward network to any dense or sparse MoE transformer block. It constructs the rule with two explicit bridges: one that rescales active width and router gain, and a second that rescales the number of activated experts while showing that first-order SDE corrections cancel. A reader would care because the rule removes the need to re-search hyperparameters whenever expert count, granularity, or total capacity changes, so that a single dense tuning run can serve every MoE variant examined.

Core claim

Complete-muE supplies an explicit transfer rule built from Bridge I, which maps dense FFN to dense MoE by active-width μP plus normalized router scale, and Bridge II, which maps dense MoE to sparse MoE by activated-expert scaling; in the second bridge the leading SDE terms for learning rate and weight decay cancel, leaving only a bounded residual σ₀ shift. The resulting rule also accommodates changes in total capacity, granularity, shared or group-balanced hybrids, network width and depth, batch size, and training duration. Language-model and diffusion-model pretraining runs show that the residual drift remains small enough for hyperparameters tuned on a single dense reference to stay near a

What carries the argument

The two-bridge Complete-muE transfer rule, where Bridge I uses active-width μP with normalized router scale and Bridge II uses activated-expert scaling that cancels first-order SDE corrections.

If this is right

  • A single dense-model hyperparameter search suffices for every tested MoE architecture and capacity.
  • MoE models can increase total capacity while retaining accelerated convergence without additional hyperparameter search cost.
  • The same rule covers simultaneous changes in activated-expert count, total capacity, granularity, and hybrid routing.
  • The transfer extends to width, depth, batch-size, and duration variations in ordinary transformers.
  • Minor residual drift appears but remains small enough that retuning is unnecessary for near-optimal performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the residual shift remains bounded at scales beyond those tested, the method could support zero-shot transfer to models orders of magnitude larger.
  • The same bridging logic might apply to other forms of conditional computation beyond standard MoE.
  • Practitioners could treat the dense reference run as a fixed calibration step before exploring any sparse variant.

Load-bearing premise

The bounded residual σ₀ shift left by Bridge II stays small enough in practice that the transferred hyperparameters remain near-optimal across MoE configurations.

What would settle it

An experiment that records a substantial shift in the optimal learning rate or weight decay when the number of activated experts is changed while holding all other factors fixed.

Figures

Figures reproduced from arXiv: 2605.23893 by Hongwu Peng, Jianming Zhang, Ohiremen Dibua, Yan Kang, Yifan Gong, Yuanjun Xiong.

Figure 1
Figure 1. Figure 1: Complete-muE transfers across dense FFN, Dense MoE, and sparse MoE [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Complete muE transfer across activated experts for both language-model (LM) and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Transfer across MoE architectural variants. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Transfer across FFN/MoE FFN width expansion. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transfer across batch size. 64e2a 64e4a 64e8a64e16a64e32a64e64a MoE setting - LLM (LR=1e-3) 3.85 3.90 3.95 4.00 4.05 Val loss Activated Expert Scaling (64 Total Experts) MoE FFN_1 (a) LM, activated experts 64e2a64e4a64e8a64e16a64e32a64e64a MoE setting - Diffusion (LR=1.6e-3) 0.465 0.470 0.475 0.480 Train loss Activated Expert Scaling (64 Total Experts) MoE FFN_1 (b) DF, activated experts 4e4a 8e4a16e4a32e4… view at source ↗
Figure 6
Figure 6. Figure 6: Fixed-hyperparameter loss scaling across MoE axes. We fix AdamW hyperparameters at the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SwiGLU MoE latency benchmark on a single [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Complete-muE for larger-scale MoE training for multi-modal generation and LLMs. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Caption 1: A cinematic head-and-shoulders video of a Black man in casual attire against a neutral seamless backdrop, softly lit with shallow focus and centered rule-of-thirds composition. The camera performs a handheld dolly-in and zoom-in toward his face as he bursts into a broad, joyful smile with visible teeth, creating a warm and expressive portrait [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as $\mu$P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width $\mu$P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual $\sigma_0$ shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts -- with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations -- \emph{tune dense once, transfer to all} is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Complete-muE, a hyperparameter transfer framework for dense FFN to any MoE configuration in transformers. Bridge I maps dense FFN to dense MoE via active-width μP with normalized router scale. Bridge II maps dense MoE to sparse MoE via activated-expert scaling, asserting that first-order SDE corrections to LR and WD cancel exactly while only a bounded residual σ₀ shift remains. The resulting rule is claimed to cover changes in activated experts, total capacity, granularity, shared/group-balanced hybrids, and general transformer scaling (width/depth/batch/duration). Experiments on language-model and diffusion-model pretraining are presented as confirming stable optima with only minor drift, enabling the practical recipe of tuning on a single dense reference and transferring near-optimally to all MoE setups.

Significance. If the transfer rule and the smallness of the residual drift hold across the claimed range of MoE configurations, the work would materially reduce the hyperparameter-search cost when scaling MoE capacity, allowing practitioners to tune once on a dense model and deploy across multiple expert counts and granularities. The extension to both language and diffusion pretraining would further increase its practical reach.

major comments (2)
  1. [Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.
  2. [Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.
minor comments (2)
  1. The term 'non-strict SDE behavior' is used without a definition or reference to the precise deviation from the SDE assumptions.
  2. The symbol σ₀ is introduced in the abstract without an accompanying definition or equation in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify opportunities to strengthen the rigor and clarity of the Bridge II derivation and the experimental quantification of drift. We will revise the manuscript to address both points directly.

read point-by-point responses
  1. Referee: [Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.

    Authors: We agree that an explicit derivation and bound would make the central claim more transparent. In the revised manuscript we will add a dedicated subsection under Bridge II that (i) derives the exact cancellation of the leading first-order SDE corrections to learning rate and weight decay when the number of activated experts changes while total capacity is held fixed, (ii) isolates the residual σ₀ term, and (iii) supplies a scaling bound on |σ₀| in terms of activated-expert count, granularity, and router temperature. The derivation will be placed immediately before the statement of the transfer rule so that readers can evaluate the exact-versus-approximate distinction. revision: yes

  2. Referee: [Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.

    Authors: We accept that the current experimental section does not provide the requested quantitative mapping. In the revision we will augment the results with (a) a table reporting the measured optimal LR and WD shifts for each MoE configuration relative to the dense reference, (b) plots of these shifts versus number of activated experts (2, 4, 8, 16) and versus granularity (token, expert, layer), and (c) explicit comparison of observed drift magnitude against the theoretical residual bound derived in the new Bridge II subsection. All experiments already performed will be re-analyzed to produce these metrics; no new runs are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external frameworks plus empirical validation

full rationale

The paper constructs Complete-muE from two bridges: Bridge I applies active-width μP (an external method) with normalized router scale, while Bridge II invokes an SDE-based claim that first-order LR/WD corrections cancel leaving a bounded residual σ₀ shift. This residual is not derived as a function of expert count or granularity but is asserted to remain small and then confirmed by language-model and diffusion pretraining experiments showing only minor drift. No equation reduces the final transfer rule to a fitted parameter renamed as prediction, no self-citation chain is load-bearing, and no ansatz is smuggled via prior author work. The central claim is therefore supported by independent experimental evidence rather than tautological reduction to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the stated limitations of μP and SDE plus two new mapping rules whose details are not visible in the abstract; no explicit free parameters or invented entities are named beyond the normalized router scale and σ₀ shift.

free parameters (2)
  • normalized router scale
    Introduced in Bridge I to map dense FFN to Dense MoE.
  • bounded residual σ₀ shift
    Remains after first-order SDE correction in Bridge II.
axioms (2)
  • domain assumption μP requires fixed architecture
    Cited as reason existing tools cannot handle MoE.
  • domain assumption SDE requires fixed per-step token count
    Cited as reason existing tools cannot handle MoE.

pith-pipeline@v0.9.0 · 5857 in / 1553 out tokens · 39219 ms · 2026-05-25T04:31:21.647852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 11 internal anchors

  1. [1]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 12

  2. [2]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  3. [3]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

  4. [4]

    Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

  5. [5]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  6. [6]

    Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

    Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

  7. [7]

    Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

    Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

  8. [8]

    Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

    Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

  9. [9]

    Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 17084–17097, 2021

  10. [10]

    Mu- parametrization for mixture of experts

    Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru ´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, et al. Mu- parametrization for mixture of experts. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  11. [11]

    On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

    Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

  12. [12]

    Don’t be lazy: Completep enables compute- efficient deep transformers

    Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute- efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  13. [13]

    Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

    Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Rama- puram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

  14. [14]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

  15. [15]

    Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

    Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

  16. [16]

    Nucleus-Image: Sparse MoE for Image Generation

    Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, and Haozhe Liu. Nucleus-image: Sparse moe for image generation.arXiv preprint arXiv:2604.12163, 2026. 13

  17. [17]

    Feature learning in infinite-depth neural networks

    Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Feature learning in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

  18. [18]

    Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

    Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

  19. [19]

    Scaling diffusion transformers efficiently viaµP

    Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, and Chongxuan Li. Scaling diffusion transformers efficiently viaµP. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  20. [20]

    Hyperparameter Transfer with Mixture-of-Expert Layers

    Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026

  21. [21]

    Rethinking Language Model Scaling under Transferable Hypersphere Optimization

    Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026

  22. [22]

    Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

  23. [23]

    Gpt-neox-20b: An open- source autoregressive language model

    Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open- source autoregressive language model. InProceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022

  24. [24]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  25. [25]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  26. [26]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  27. [27]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  28. [28]

    In search of adam’s secret sauce

    Antonio Orvieto and Robert M Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  29. [29]

    Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

    Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I Mestre, Manuel F Dolz, and Enrique S Quintana-Ortí. Why adam works better with β1 =β 2: The missing gradient scale invariance principle.arXiv preprint arXiv:2601.21739, 2026

  30. [30]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  31. [31]

    Minicpm: Unveiling the potential of small language models with scalable training strategies

    Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. InFirst Conference on Language Modeling

  32. [32]

    Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

    Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna B Allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

  33. [33]

    Scaling law with learning rate annealing

    Howe Tissue, Venus Wang, and Lu Wang. Scaling law with learning rate annealing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  34. [34]

    Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025

    Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, and Tri Dao. Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025. 14

  35. [35]

    Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

    Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, and Yuanjun Xiong. Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

  36. [36]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus

    Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021

  37. [37]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  38. [38]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...

  39. [39]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  40. [40]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  41. [41]

    SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning

    Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret, editors,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of...

  42. [42]

    Association for Computational Linguistics

  43. [43]

    Piqa: Reasoning about phys- ical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  44. [44]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

  45. [45]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  46. [46]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- ...

  47. [47]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

  48. [48]

    AGIEval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

  49. [49]

    16 Table 5: Notation used in the method and appendix

    Association for Computational Linguistics. 16 Table 5: Notation used in the method and appendix. Symbol Meaning ρd, ρH residual/model-width ratiod/d ⋆ and active-width ratioH/d. ρL, ρB , ρD depth, global batch, and global token-budget ratios relative to the reference model. ρHa , ρN active-width ratio induced by activated experts,ρ Ha =ah/d, and total-exp...

  50. [50]

    The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

    Dense-width step.Transfer the auxiliary Dense MoE from total width N h to N ′h via Bridge I. The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

  51. [51]

    The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a

    Reverse-sparsity step.At fixed total experts N ′, reduce the activated count from N ′ back toavia Bridge II. The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a. The extra dense-width factor from Step 1 is exactly reversed by Step 2. Because the sparsity rule exactly reverses the extra act...