Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models
Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3
The pith
Hyperparameters tuned on one dense model transfer near-optimally to any Mixture-of-Experts configuration via the Complete-muE rule.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Complete-muE supplies an explicit transfer rule built from Bridge I, which maps dense FFN to dense MoE by active-width μP plus normalized router scale, and Bridge II, which maps dense MoE to sparse MoE by activated-expert scaling; in the second bridge the leading SDE terms for learning rate and weight decay cancel, leaving only a bounded residual σ₀ shift. The resulting rule also accommodates changes in total capacity, granularity, shared or group-balanced hybrids, network width and depth, batch size, and training duration. Language-model and diffusion-model pretraining runs show that the residual drift remains small enough for hyperparameters tuned on a single dense reference to stay near a
What carries the argument
The two-bridge Complete-muE transfer rule, where Bridge I uses active-width μP with normalized router scale and Bridge II uses activated-expert scaling that cancels first-order SDE corrections.
If this is right
- A single dense-model hyperparameter search suffices for every tested MoE architecture and capacity.
- MoE models can increase total capacity while retaining accelerated convergence without additional hyperparameter search cost.
- The same rule covers simultaneous changes in activated-expert count, total capacity, granularity, and hybrid routing.
- The transfer extends to width, depth, batch-size, and duration variations in ordinary transformers.
- Minor residual drift appears but remains small enough that retuning is unnecessary for near-optimal performance.
Where Pith is reading between the lines
- If the residual shift remains bounded at scales beyond those tested, the method could support zero-shot transfer to models orders of magnitude larger.
- The same bridging logic might apply to other forms of conditional computation beyond standard MoE.
- Practitioners could treat the dense reference run as a fixed calibration step before exploring any sparse variant.
Load-bearing premise
The bounded residual σ₀ shift left by Bridge II stays small enough in practice that the transferred hyperparameters remain near-optimal across MoE configurations.
What would settle it
An experiment that records a substantial shift in the optimal learning rate or weight decay when the number of activated experts is changed while holding all other factors fixed.
Figures
read the original abstract
We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as $\mu$P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width $\mu$P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual $\sigma_0$ shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts -- with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations -- \emph{tune dense once, transfer to all} is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Complete-muE, a hyperparameter transfer framework for dense FFN to any MoE configuration in transformers. Bridge I maps dense FFN to dense MoE via active-width μP with normalized router scale. Bridge II maps dense MoE to sparse MoE via activated-expert scaling, asserting that first-order SDE corrections to LR and WD cancel exactly while only a bounded residual σ₀ shift remains. The resulting rule is claimed to cover changes in activated experts, total capacity, granularity, shared/group-balanced hybrids, and general transformer scaling (width/depth/batch/duration). Experiments on language-model and diffusion-model pretraining are presented as confirming stable optima with only minor drift, enabling the practical recipe of tuning on a single dense reference and transferring near-optimally to all MoE setups.
Significance. If the transfer rule and the smallness of the residual drift hold across the claimed range of MoE configurations, the work would materially reduce the hyperparameter-search cost when scaling MoE capacity, allowing practitioners to tune once on a dense model and deploy across multiple expert counts and granularities. The extension to both language and diffusion pretraining would further increase its practical reach.
major comments (2)
- [Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.
- [Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.
minor comments (2)
- The term 'non-strict SDE behavior' is used without a definition or reference to the precise deviation from the SDE assumptions.
- The symbol σ₀ is introduced in the abstract without an accompanying definition or equation in the main text.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The two major comments identify opportunities to strengthen the rigor and clarity of the Bridge II derivation and the experimental quantification of drift. We will revise the manuscript to address both points directly.
read point-by-point responses
-
Referee: [Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.
Authors: We agree that an explicit derivation and bound would make the central claim more transparent. In the revised manuscript we will add a dedicated subsection under Bridge II that (i) derives the exact cancellation of the leading first-order SDE corrections to learning rate and weight decay when the number of activated experts changes while total capacity is held fixed, (ii) isolates the residual σ₀ term, and (iii) supplies a scaling bound on |σ₀| in terms of activated-expert count, granularity, and router temperature. The derivation will be placed immediately before the statement of the transfer rule so that readers can evaluate the exact-versus-approximate distinction. revision: yes
-
Referee: [Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.
Authors: We accept that the current experimental section does not provide the requested quantitative mapping. In the revision we will augment the results with (a) a table reporting the measured optimal LR and WD shifts for each MoE configuration relative to the dense reference, (b) plots of these shifts versus number of activated experts (2, 4, 8, 16) and versus granularity (token, expert, layer), and (c) explicit comparison of observed drift magnitude against the theoretical residual bound derived in the new Bridge II subsection. All experiments already performed will be re-analyzed to produce these metrics; no new runs are required. revision: yes
Circularity Check
No significant circularity; derivation relies on external frameworks plus empirical validation
full rationale
The paper constructs Complete-muE from two bridges: Bridge I applies active-width μP (an external method) with normalized router scale, while Bridge II invokes an SDE-based claim that first-order LR/WD corrections cancel leaving a bounded residual σ₀ shift. This residual is not derived as a function of expert count or granularity but is asserted to remain small and then confirmed by language-model and diffusion pretraining experiments showing only minor drift. No equation reduces the final transfer rule to a fitted parameter renamed as prediction, no self-citation chain is load-bearing, and no ansatz is smuggled via prior author work. The central claim is therefore supported by independent experimental evidence rather than tautological reduction to its inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- normalized router scale
- bounded residual σ₀ shift
axioms (2)
- domain assumption μP requires fixed architecture
- domain assumption SDE requires fixed per-step token count
Reference graph
Works this paper leans on
-
[1]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 12
work page 2017
-
[2]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[3]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024
work page 2024
-
[5]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025
-
[7]
Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025
Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025
-
[8]
Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025
Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025
-
[9]
Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer
Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 17084–17097, 2021
work page 2021
-
[10]
Mu- parametrization for mixture of experts
Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru ´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, et al. Mu- parametrization for mixture of experts. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
-
[11]
Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022
work page 2022
-
[12]
Don’t be lazy: Completep enables compute- efficient deep transformers
Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute- efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[13]
Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Rama- puram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025
-
[14]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[15]
Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024
-
[16]
Nucleus-Image: Sparse MoE for Image Generation
Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, and Haozhe Liu. Nucleus-image: Sparse moe for image generation.arXiv preprint arXiv:2604.12163, 2026. 13
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Feature learning in infinite-depth neural networks
Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Feature learning in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023
work page 2023
-
[18]
Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024
work page 2024
-
[19]
Scaling diffusion transformers efficiently viaµP
Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, and Chongxuan Li. Scaling diffusion transformers efficiently viaµP. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[20]
Hyperparameter Transfer with Mixture-of-Expert Layers
Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022
work page 2022
-
[23]
Gpt-neox-20b: An open- source autoregressive language model
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open- source autoregressive language model. InProceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022
work page 2022
-
[24]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[27]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[28]
In search of adam’s secret sauce
Antonio Orvieto and Robert M Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[29]
Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle
Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I Mestre, Manuel F Dolz, and Enrique S Quintana-Ortí. Why adam works better with β1 =β 2: The missing gradient scale invariance principle.arXiv preprint arXiv:2601.21739, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[31]
Minicpm: Unveiling the potential of small language models with scalable training strategies
Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. InFirst Conference on Language Modeling
-
[32]
Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna B Allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024
work page 2024
-
[33]
Scaling law with learning rate annealing
Howe Tissue, Venus Wang, and Lu Wang. Scaling law with learning rate annealing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[34]
Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, and Tri Dao. Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025. 14
-
[35]
Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, and Yuanjun Xiong. Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025
-
[36]
Documenting large webtext corpora: A case study on the colossal clean crawled corpus
Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021
work page 2021
-
[37]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[38]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...
work page 2021
-
[39]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021
work page 2021
-
[40]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning
Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret, editors,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of...
work page 2012
-
[42]
Association for Computational Linguistics
-
[43]
Piqa: Reasoning about phys- ical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020
work page 2020
-
[44]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
work page 2019
-
[45]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[46]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- ...
work page 2016
-
[47]
Boolq: Exploring the surprising difficulty of natural yes/no questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...
work page 2019
-
[48]
AGIEval: A human-centric benchmark for evaluating foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June
work page 2024
-
[49]
16 Table 5: Notation used in the method and appendix
Association for Computational Linguistics. 16 Table 5: Notation used in the method and appendix. Symbol Meaning ρd, ρH residual/model-width ratiod/d ⋆ and active-width ratioH/d. ρL, ρB , ρD depth, global batch, and global token-budget ratios relative to the reference model. ρHa , ρN active-width ratio induced by activated experts,ρ Ha =ah/d, and total-exp...
-
[50]
Dense-width step.Transfer the auxiliary Dense MoE from total width N h to N ′h via Bridge I. The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′
-
[51]
Reverse-sparsity step.At fixed total experts N ′, reduce the activated count from N ′ back toavia Bridge II. The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a. The extra dense-width factor from Step 1 is exactly reversed by Step 2. Because the sparsity rule exactly reverses the extra act...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.