Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Hongwu Peng; Jianming Zhang; Ohiremen Dibua; Yan Kang; Yifan Gong; Yuanjun Xiong

arxiv: 2605.23893 · v1 · pith:GSIZEKHAnew · submitted 2026-05-22 · 💻 cs.LG

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Hongwu Peng , Ohiremen Dibua , Yuanjun Xiong , Yifan Gong , Jianming Zhang , Yan Kang This is my paper

Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords hyperparameter transfermixture of expertsscalingtransformermuPSDE approximationpretraining

0 comments

The pith

Hyperparameters tuned on one dense model transfer near-optimally to any Mixture-of-Experts configuration via the Complete-muE rule.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to derive a single transfer rule that moves optimal learning rate, weight decay, and related settings from a dense feed-forward network to any dense or sparse MoE transformer block. It constructs the rule with two explicit bridges: one that rescales active width and router gain, and a second that rescales the number of activated experts while showing that first-order SDE corrections cancel. A reader would care because the rule removes the need to re-search hyperparameters whenever expert count, granularity, or total capacity changes, so that a single dense tuning run can serve every MoE variant examined.

Core claim

Complete-muE supplies an explicit transfer rule built from Bridge I, which maps dense FFN to dense MoE by active-width μP plus normalized router scale, and Bridge II, which maps dense MoE to sparse MoE by activated-expert scaling; in the second bridge the leading SDE terms for learning rate and weight decay cancel, leaving only a bounded residual σ₀ shift. The resulting rule also accommodates changes in total capacity, granularity, shared or group-balanced hybrids, network width and depth, batch size, and training duration. Language-model and diffusion-model pretraining runs show that the residual drift remains small enough for hyperparameters tuned on a single dense reference to stay near a

What carries the argument

The two-bridge Complete-muE transfer rule, where Bridge I uses active-width μP with normalized router scale and Bridge II uses activated-expert scaling that cancels first-order SDE corrections.

If this is right

A single dense-model hyperparameter search suffices for every tested MoE architecture and capacity.
MoE models can increase total capacity while retaining accelerated convergence without additional hyperparameter search cost.
The same rule covers simultaneous changes in activated-expert count, total capacity, granularity, and hybrid routing.
The transfer extends to width, depth, batch-size, and duration variations in ordinary transformers.
Minor residual drift appears but remains small enough that retuning is unnecessary for near-optimal performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the residual shift remains bounded at scales beyond those tested, the method could support zero-shot transfer to models orders of magnitude larger.
The same bridging logic might apply to other forms of conditional computation beyond standard MoE.
Practitioners could treat the dense reference run as a fixed calibration step before exploring any sparse variant.

Load-bearing premise

The bounded residual σ₀ shift left by Bridge II stays small enough in practice that the transferred hyperparameters remain near-optimal across MoE configurations.

What would settle it

An experiment that records a substantial shift in the optimal learning rate or weight decay when the number of activated experts is changed while holding all other factors fixed.

Figures

Figures reproduced from arXiv: 2605.23893 by Hongwu Peng, Jianming Zhang, Ohiremen Dibua, Yan Kang, Yifan Gong, Yuanjun Xiong.

**Figure 2.** Figure 2: Complete muE transfer across activated experts for both language-model (LM) and [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Transfer across MoE architectural variants. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Transfer across FFN/MoE FFN width expansion. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Transfer across batch size. 64e2a 64e4a 64e8a64e16a64e32a64e64a MoE setting - LLM (LR=1e-3) 3.85 3.90 3.95 4.00 4.05 Val loss Activated Expert Scaling (64 Total Experts) MoE FFN_1 (a) LM, activated experts 64e2a64e4a64e8a64e16a64e32a64e64a MoE setting - Diffusion (LR=1.6e-3) 0.465 0.470 0.475 0.480 Train loss Activated Expert Scaling (64 Total Experts) MoE FFN_1 (b) DF, activated experts 4e4a 8e4a16e4a32e4… view at source ↗

**Figure 6.** Figure 6: Fixed-hyperparameter loss scaling across MoE axes. We fix AdamW hyperparameters at the [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: SwiGLU MoE latency benchmark on a single [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Complete-muE for larger-scale MoE training for multi-modal generation and LLMs. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Caption 1: A cinematic head-and-shoulders video of a Black man in casual attire against a neutral seamless backdrop, softly lit with shallow focus and centered rule-of-thirds composition. The camera performs a handheld dolly-in and zoom-in toward his face as he bursts into a broad, joyful smile with visible teeth, creating a warm and expressive portrait [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as $\mu$P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width $\mu$P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual $\sigma_0$ shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts -- with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations -- \emph{tune dense once, transfer to all} is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Complete-muE gives a workable two-bridge transfer from dense to MoE but the exact cancellation claimed in Bridge II lacks a derivation and scaling bound.

read the letter

The main takeaway is that this paper offers a concrete rule for moving hyperparameters from a dense reference model to a wide range of MoE configurations without retuning at each step. The rule rests on two bridges: Bridge I applies active-width μP with a normalized router scale to go from dense FFN to dense MoE, and Bridge II applies activated-expert scaling to go from dense MoE to sparse MoE, where the leading SDE corrections to learning rate and weight decay are said to cancel and leave only a small bounded residual σ0 shift. This combination handles changes in activated experts, total capacity, granularity, and hybrids, plus the usual width, depth, batch, and duration adjustments for transformers in general. The language-model and diffusion pretraining runs show that the transferred values stay close to optimal across architectures and sizes, with only the minor drift the authors attribute to non-strict SDE behavior. That empirical stability is the practical strength and supports the claim that one dense tuning run can serve many MoE variants. The synthesis itself is new; prior μP and SDE work did not jointly address the architecture change and the token-per-expert change that occur when experts are introduced or scaled. The experiments give direct evidence that the recipe produces usable transfer in the tested regimes. The soft spot is the justification for Bridge II. The abstract states that the first-order SDE terms cancel while a bounded residual remains, yet no derivation is supplied to show why the cancellation is exact rather than approximate, nor is there an explicit bound on how the residual grows with expert count or granularity. If that residual increases even modestly, the near-optimal transfer would break for the largest MoE setups the method is meant to cover. The experiments report the drift stays small enough in practice, but the paper would be stronger with controls that map the residual size against the quantities it claims to handle. This work is aimed at practitioners who scale sparse models and need to cut hyperparameter search cost. A reader who cares about transfer rules for large training runs will find the recipe and the reported stability useful. The problem is real and the proposed solution is specific, so the paper deserves a serious referee to examine the derivations and test the residual bound more systematically.

Referee Report

2 major / 2 minor

Summary. The paper proposes Complete-muE, a hyperparameter transfer framework for dense FFN to any MoE configuration in transformers. Bridge I maps dense FFN to dense MoE via active-width μP with normalized router scale. Bridge II maps dense MoE to sparse MoE via activated-expert scaling, asserting that first-order SDE corrections to LR and WD cancel exactly while only a bounded residual σ₀ shift remains. The resulting rule is claimed to cover changes in activated experts, total capacity, granularity, shared/group-balanced hybrids, and general transformer scaling (width/depth/batch/duration). Experiments on language-model and diffusion-model pretraining are presented as confirming stable optima with only minor drift, enabling the practical recipe of tuning on a single dense reference and transferring near-optimally to all MoE setups.

Significance. If the transfer rule and the smallness of the residual drift hold across the claimed range of MoE configurations, the work would materially reduce the hyperparameter-search cost when scaling MoE capacity, allowing practitioners to tune once on a dense model and deploy across multiple expert counts and granularities. The extension to both language and diffusion pretraining would further increase its practical reach.

major comments (2)

[Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.
[Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.

minor comments (2)

The term 'non-strict SDE behavior' is used without a definition or reference to the precise deviation from the SDE assumptions.
The symbol σ₀ is introduced in the abstract without an accompanying definition or equation in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The two major comments identify opportunities to strengthen the rigor and clarity of the Bridge II derivation and the experimental quantification of drift. We will revise the manuscript to address both points directly.

read point-by-point responses

Referee: [Bridge II] Bridge II (abstract and § on Bridge II): the central claim that 'the first-order SDE LR/WD correction cancels while a bounded residual σ₀ shift remains' is stated without an explicit derivation, equation, or proof showing why the leading terms cancel exactly rather than approximately. No scaling bound is supplied for the size of the residual as a function of activated-expert count, total capacity, or granularity; this assumption is load-bearing for the 'tune dense once, transfer to all' rule.

Authors: We agree that an explicit derivation and bound would make the central claim more transparent. In the revised manuscript we will add a dedicated subsection under Bridge II that (i) derives the exact cancellation of the leading first-order SDE corrections to learning rate and weight decay when the number of activated experts changes while total capacity is held fixed, (ii) isolates the residual σ₀ term, and (iii) supplies a scaling bound on |σ₀| in terms of activated-expert count, granularity, and router temperature. The derivation will be placed immediately before the statement of the transfer rule so that readers can evaluate the exact-versus-approximate distinction. revision: yes
Referee: [Experimental validation] § on experimental results: the text asserts that 'hyperparameters tuned on a single dense reference transfer near-optimally' and that drift is 'small enough in practice,' yet the magnitude of the observed drift is not quantified against the number of activated experts or granularity changes; without such quantification the claim that the residual remains negligible for the targeted MoE regimes cannot be evaluated.

Authors: We accept that the current experimental section does not provide the requested quantitative mapping. In the revision we will augment the results with (a) a table reporting the measured optimal LR and WD shifts for each MoE configuration relative to the dense reference, (b) plots of these shifts versus number of activated experts (2, 4, 8, 16) and versus granularity (token, expert, layer), and (c) explicit comparison of observed drift magnitude against the theoretical residual bound derived in the new Bridge II subsection. All experiments already performed will be re-analyzed to produce these metrics; no new runs are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external frameworks plus empirical validation

full rationale

The paper constructs Complete-muE from two bridges: Bridge I applies active-width μP (an external method) with normalized router scale, while Bridge II invokes an SDE-based claim that first-order LR/WD corrections cancel leaving a bounded residual σ₀ shift. This residual is not derived as a function of expert count or granularity but is asserted to remain small and then confirmed by language-model and diffusion pretraining experiments showing only minor drift. No equation reduces the final transfer rule to a fitted parameter renamed as prediction, no self-citation chain is load-bearing, and no ansatz is smuggled via prior author work. The central claim is therefore supported by independent experimental evidence rather than tautological reduction to its inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the stated limitations of μP and SDE plus two new mapping rules whose details are not visible in the abstract; no explicit free parameters or invented entities are named beyond the normalized router scale and σ₀ shift.

free parameters (2)

normalized router scale
Introduced in Bridge I to map dense FFN to Dense MoE.
bounded residual σ₀ shift
Remains after first-order SDE correction in Bridge II.

axioms (2)

domain assumption μP requires fixed architecture
Cited as reason existing tools cannot handle MoE.
domain assumption SDE requires fixed per-step token count
Cited as reason existing tools cannot handle MoE.

pith-pipeline@v0.9.0 · 5857 in / 1553 out tokens · 39219 ms · 2026-05-25T04:31:21.647852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 11 internal anchors

[1]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 12

work page 2017
[2]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[3]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

work page 2024
[5]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

work page arXiv 2025
[7]

Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

work page arXiv 2025
[8]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025
[9]

Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 17084–17097, 2021

work page 2021
[10]

Mu- parametrization for mixture of experts

Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru ´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, et al. Mu- parametrization for mixture of experts. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page
[11]

On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

work page 2022
[12]

Don’t be lazy: Completep enables compute- efficient deep transformers

Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute- efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[13]

Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Rama- puram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

work page arXiv 2025
[14]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[15]

Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

work page arXiv 2024
[16]

Nucleus-Image: Sparse MoE for Image Generation

Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, and Haozhe Liu. Nucleus-image: Sparse moe for image generation.arXiv preprint arXiv:2604.12163, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Feature learning in infinite-depth neural networks

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Feature learning in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023
[18]

Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

work page 2024
[19]

Scaling diffusion transformers efficiently viaµP

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, and Chongxuan Li. Scaling diffusion transformers efficiently viaµP. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[20]

Hyperparameter Transfer with Mixture-of-Expert Layers

Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

work page 2022
[23]

Gpt-neox-20b: An open- source autoregressive language model

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open- source autoregressive language model. InProceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022

work page 2022
[24]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[27]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[28]

In search of adam’s secret sauce

Antonio Orvieto and Robert M Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[29]

Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I Mestre, Manuel F Dolz, and Enrique S Quintana-Ortí. Why adam works better with β1 =β 2: The missing gradient scale invariance principle.arXiv preprint arXiv:2601.21739, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[31]

Minicpm: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. InFirst Conference on Language Modeling

work page
[32]

Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna B Allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

work page 2024
[33]

Scaling law with learning rate annealing

Howe Tissue, Venus Wang, and Lu Wang. Scaling law with learning rate annealing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[34]

Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025

Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, and Tri Dao. Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025. 14

work page arXiv 2025
[35]

Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, and Yuanjun Xiong. Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

work page arXiv 2025
[36]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021

work page 2021
[37]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[38]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...

work page 2021
[39]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[40]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret, editors,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of...

work page 2012
[42]

Association for Computational Linguistics

work page
[43]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[44]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

work page 2019
[45]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[46]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- ...

work page 2016
[47]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019
[48]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

work page 2024
[49]

16 Table 5: Notation used in the method and appendix

Association for Computational Linguistics. 16 Table 5: Notation used in the method and appendix. Symbol Meaning ρd, ρH residual/model-width ratiod/d ⋆ and active-width ratioH/d. ρL, ρB , ρD depth, global batch, and global token-budget ratios relative to the reference model. ρHa , ρN active-width ratio induced by activated experts,ρ Ha =ah/d, and total-exp...

work page
[50]

The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

Dense-width step.Transfer the auxiliary Dense MoE from total width N h to N ′h via Bridge I. The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

work page
[51]

The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a

Reverse-sparsity step.At fixed total experts N ′, reduce the activated count from N ′ back toavia Bridge II. The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a. The extra dense-width factor from Step 1 is exactly reversed by Step 2. Because the sparsity rule exactly reverses the extra act...

work page

[1] [1]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017. 12

work page 2017

[2] [2]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[3] [3]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

work page 2024

[5] [5]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

Yehui Tang, Xiaosong Li, Fangcheng Liu, Wei Guo, Hang Zhou, Yaoyuan Wang, Kai Han, Xianzhi Yu, Jinpeng Li, Hui Zang, et al. Pangu pro moe: Mixture of grouped experts for efficient sparsity.arXiv preprint arXiv:2505.21411, 2025

work page arXiv 2025

[7] [7]

Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. Pangu ultra moe: How to train your big moe on ascend npus.arXiv preprint arXiv:2505.04519, 2025

work page arXiv 2025

[8] [8]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025

[9] [9]

Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer

Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: tuning large neural networks via zero-shot hyperparameter transfer. InProceedings of the 35th International Conference on Neural Information Processing Systems, pages 17084–17097, 2021

work page 2021

[10] [10]

Mu- parametrization for mixture of experts

Jan Mała´snicki, Kamil Ciebiera, Mateusz Boru ´n, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, et al. Mu- parametrization for mixture of experts. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page

[11] [11]

On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms.Advances in Neural Information Processing Systems, 35:7697–7711, 2022

work page 2022

[12] [12]

Don’t be lazy: Completep enables compute- efficient deep transformers

Nolan Simran Dey, Bin Claire Zhang, Lorenzo Noci, Mufan Li, Blake Bordelon, Shane Bergsma, Cengiz Pehlevan, Boris Hanin, and Joel Hestness. Don’t be lazy: Completep enables compute- efficient deep transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[13] [13]

Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

Bruno Mlodozeniec, Pierre Ablin, Louis Béthune, Dan Busbridge, Michal Klein, Jason Rama- puram, and Marco Cuturi. Completed hyperparameter transfer across modules, width, depth, batch and duration.arXiv preprint arXiv:2512.22382, 2025

work page arXiv 2025

[14] [14]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with condi- tional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[15] [15]

Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing.arXiv preprint arXiv:2410.02098, 2024

work page arXiv 2024

[16] [16]

Nucleus-Image: Sparse MoE for Image Generation

Chandan Akiti, Ajay Modukuri, Murali Nandan Nagarapu, Gunavardhan Akiti, and Haozhe Liu. Nucleus-image: Sparse moe for image generation.arXiv preprint arXiv:2604.12163, 2026. 13

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Feature learning in infinite-depth neural networks

Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Feature learning in infinite-depth neural networks. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023

[18] [18]

Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

Blake Bordelon, Hamza Chaudhry, and Cengiz Pehlevan. Infinite limits of multi-head trans- former dynamics.Advances in Neural Information Processing Systems, 37:35824–35878, 2024

work page 2024

[19] [19]

Scaling diffusion transformers efficiently viaµP

Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, and Chongxuan Li. Scaling diffusion transformers efficiently viaµP. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[20] [20]

Hyperparameter Transfer with Mixture-of-Expert Layers

Tianze Jiang, Blake Bordelon, Cengiz Pehlevan, and Boris Hanin. Hyperparameter transfer with mixture-of-expert layers.arXiv preprint arXiv:2601.20205, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, and Weizhu Chen. Rethinking language model scaling under transferable hypersphere optimization.arXiv preprint arXiv:2603.28743, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, Quoc V Le, James Laudon, et al. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems, 35:7103–7114, 2022

work page 2022

[23] [23]

Gpt-neox-20b: An open- source autoregressive language model

Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al. Gpt-neox-20b: An open- source autoregressive language model. InProceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022

work page 2022

[24] [24]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[27] [27]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[28] [28]

In search of adam’s secret sauce

Antonio Orvieto and Robert M Gower. In search of adam’s secret sauce. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[29] [29]

Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I Mestre, Manuel F Dolz, and Enrique S Quintana-Ortí. Why adam works better with β1 =β 2: The missing gradient scale invariance principle.arXiv preprint arXiv:2601.21739, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[31] [31]

Minicpm: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Ganqu Cui, Chaoqun He, Weilin Zhao, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. InFirst Conference on Language Modeling

work page

[32] [32]

Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna B Allal, Leandro V on Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations.Advances in Neural Information Processing Systems, 37:76232–76264, 2024

work page 2024

[33] [33]

Scaling law with learning rate annealing

Howe Tissue, Venus Wang, and Lu Wang. Scaling law with learning rate annealing. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[34] [34]

Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025

Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, and Tri Dao. Sonicmoe: Accelerating moe with io and tile-aware optimizations.arXiv preprint arXiv:2512.14080, 2025. 14

work page arXiv 2025

[35] [35]

Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

Kai Zhang, Peng Wang, Sai Bi, Jianming Zhang, and Yuanjun Xiong. Knapformer: An online load balancer for efficient diffusion transformers training.arXiv preprint arXiv:2508.06001, 2025

work page arXiv 2025

[36] [36]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus

Jesse Dodge, Maarten Sap, Ana Marasovi´c, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, 2021

work page 2021

[37] [37]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[38] [38]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors,Proceedings of the 2021 Conference of the North American Chapter of the Associ...

work page 2021

[39] [39]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[40] [40]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Andrew Gordon, Zornitsa Kozareva, and Melissa Roemmele. SemEval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret, editors,*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of...

work page 2012

[42] [42]

Association for Computational Linguistics

work page

[43] [43]

Piqa: Reasoning about phys- ical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about phys- ical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[44] [44]

Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

work page 2019

[45] [45]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[46] [46]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Katrin Erk and Noah A. Smith, editors,Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- ...

work page 2016

[47] [47]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),...

work page 2019

[48] [48]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, Mexico City, Mexico, June

work page 2024

[49] [49]

16 Table 5: Notation used in the method and appendix

Association for Computational Linguistics. 16 Table 5: Notation used in the method and appendix. Symbol Meaning ρd, ρH residual/model-width ratiod/d ⋆ and active-width ratioH/d. ρL, ρB , ρD depth, global batch, and global token-budget ratios relative to the reference model. ρHa , ρN active-width ratio induced by activated experts,ρ Ha =ah/d, and total-exp...

work page

[50] [50]

The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

Dense-width step.Transfer the auxiliary Dense MoE from total width N h to N ′h via Bridge I. The output multiplier and initialization become c(1) = d N ′h , σ (1) down = r N ′h d σ(1) down,⋆, and the auxiliary route scale becomesr N ′ =N ′

work page

[51] [51]

The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a

Reverse-sparsity step.At fixed total experts N ′, reduce the activated count from N ′ back toavia Bridge II. The active-width rule atH a =ahgives c(2) = d ah , σ (2) down = r ah d σ(1) down,⋆, and restores the route scaler a =a. The extra dense-width factor from Step 1 is exactly reversed by Step 2. Because the sparsity rule exactly reverses the extra act...

work page