Post-Trained MoE Can Skip Half Experts via Self-Distillation
Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3
The pith
Post-trained static MoE models can be turned into dynamic ones that skip over half their experts with almost no accuracy loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-Expert Self-Distillation Adaptation (ZEDA) transforms post-trained static MoE models into efficient dynamic ones by injecting parameter-free zero-output experts into each MoE layer and adapting the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss, outperforming the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20 times end-to-end inference speedup.
What carries the argument
The injection of parameter-free zero-output experts into each MoE layer together with two-stage self-distillation from the frozen original static MoE as teacher and a group-level balancing loss.
If this is right
- Already-trained static MoE models can be converted to dynamic routing without full retraining from scratch.
- Expert FLOPs can be reduced by more than half while accuracy stays nearly the same on math, code, and instruction tasks.
- The resulting dynamic models outperform earlier dynamic MoE baselines by 4 to 6 points on the tested models.
- End-to-end inference speed improves by about 1.2 times after the adaptation.
Where Pith is reading between the lines
- Many computations performed by experts in a trained MoE may be redundant for a large fraction of inputs, allowing safe bypass at serving time.
- The same zero-expert injection plus distillation pattern could be tested on other sparse architectures beyond standard MoE layers.
- If the balancing loss proves robust, similar post-training routes might reduce activation costs in future larger MoE variants without changing pre-training.
Load-bearing premise
Adding zero-output experts and distilling from the frozen original model will preserve the original capabilities without major degradation.
What would settle it
An experiment on the same models and benchmarks that shows accuracy falling by more than a few points after ZEDA adaptation or that measured wall-clock speedup falls below 1.1 times due to routing overhead.
read the original abstract
Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework to convert post-trained static MoE models into dynamic ones. It injects parameter-free zero-output experts into each MoE layer and adapts the model via two-stage self-distillation (using the original static MoE as frozen teacher) plus a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA is reported to eliminate over 50% of expert FLOPs at marginal accuracy loss while outperforming the strongest dynamic MoE baseline by 6.1 and 4.0 points and achieving ~1.20x end-to-end speedup.
Significance. If the empirical results hold under rigorous validation, the work would provide a practical low-cost route to retrofit existing post-trained MoE models for dynamic expert skipping, directly addressing inference cost in large-scale deployments without requiring pre-training from scratch or task-specific fine-tuning.
major comments (3)
- Abstract and experimental summary: the headline claim of ≥50% expert-FLOP reduction at marginal accuracy loss (and the 6.1/4.0-point gains) is presented only as a high-level quantitative summary; no error bars, ablation tables, or precise data-exclusion rules are referenced, so the central empirical result cannot be assessed for robustness.
- Method description (ZEDA framework): the necessity of injecting parameter-free zero-output experts is asserted to stabilize the router's ability to route easy tokens while preserving output on hard tokens, yet no ablation is reported that compares this choice against simply lowering top-k or adding a learned skip token; this directly bears on whether the reported FLOP savings are attributable to the proposed mechanism.
- Results on Qwen3-30B-A3B and GLM-4.7-Flash: the outperformance over the strongest dynamic baseline is stated without per-benchmark tables, statistical significance tests, or breakdown by task category (math/code/instruction), leaving open whether the gains are uniform or concentrated on easier subsets where zero-expert routing is trivial.
minor comments (2)
- Clarify the exact formulation of the group-level balancing loss (e.g., provide its equation and hyper-parameter schedule) so that the two-stage distillation procedure can be reproduced.
- Add a short paragraph contrasting ZEDA with prior dynamic-MoE adaptation methods that also use distillation, to better position the novelty of the zero-expert injection step.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, proposing specific revisions to strengthen the empirical presentation and methodological justification while preserving the core contributions of ZEDA.
read point-by-point responses
-
Referee: Abstract and experimental summary: the headline claim of ≥50% expert-FLOP reduction at marginal accuracy loss (and the 6.1/4.0-point gains) is presented only as a high-level quantitative summary; no error bars, ablation tables, or precise data-exclusion rules are referenced, so the central empirical result cannot be assessed for robustness.
Authors: We agree that additional details are needed to allow rigorous assessment of robustness. In the revision we will expand the experimental section and appendix to include (i) error bars computed over multiple random seeds for the main results where compute permits, (ii) explicit ablation tables for the two-stage distillation and balancing loss, and (iii) a precise statement of any data-exclusion or filtering rules applied to the 11 benchmarks. These additions will make the ≥50% FLOP reduction and the 6.1/4.0-point gains directly verifiable. revision: yes
-
Referee: Method description (ZEDA framework): the necessity of injecting parameter-free zero-output experts is asserted to stabilize the router's ability to route easy tokens while preserving output on hard tokens, yet no ablation is reported that compares this choice against simply lowering top-k or adding a learned skip token; this directly bears on whether the reported FLOP savings are attributable to the proposed mechanism.
Authors: We acknowledge that an explicit ablation would strengthen the causal link between the zero-output expert design and the observed savings. We will add a new ablation subsection that directly compares (a) our parameter-free zero-output experts, (b) simply lowering top-k on the original model, and (c) introducing a learned skip token. The results will quantify how each variant affects router stability, FLOP reduction, and accuracy, thereby clarifying the contribution of the proposed mechanism. revision: yes
-
Referee: Results on Qwen3-30B-A3B and GLM-4.7-Flash: the outperformance over the strongest dynamic baseline is stated without per-benchmark tables, statistical significance tests, or breakdown by task category (math/code/instruction), leaving open whether the gains are uniform or concentrated on easier subsets where zero-expert routing is trivial.
Authors: We agree that per-benchmark granularity and task-category analysis are important for interpreting the gains. In the revised manuscript we will (i) move the full per-benchmark accuracy and FLOP tables to the main body or a prominent appendix, (ii) report statistical significance (paired t-tests or Wilcoxon tests) between ZEDA and the strongest baseline, and (iii) provide a breakdown by task category (math, code, instruction following) showing that the 6.1- and 4.0-point improvements hold across categories rather than being driven solely by easier subsets. revision: yes
Circularity Check
No circularity: empirical adaptation procedure with independent benchmark validation
full rationale
The paper describes ZEDA as an engineering adaptation: parameter-free zero-output experts are injected, followed by two-stage self-distillation against a frozen teacher plus a group-level balancing loss. Reported gains (≥50% expert-FLOP reduction at marginal accuracy loss on 11 benchmarks) are measured directly on held-out tasks for Qwen3-30B-A3B and GLM-4.7-Flash. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the headline result to a tautology or to quantities defined by the inputs themselves. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-distillation from a frozen original MoE teacher preserves downstream capability after architectural augmentation
invented entities (1)
-
parameter-free zero-output experts
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ZEDA injects parameter-free zero-output experts ... group-level balancing loss ... eliminates over 50% of expert FLOPs
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage self-distillation ... Group Auxiliary Loss L_GA regulates relative activation frequency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computa- tion and automatic sharding.arXiv preprint arXiv:2006.16668,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[2]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods with zero-computation experts.arXiv preprint arXiv:2410.07348,
-
[4]
Longcat-flash technical report.arXiv preprint arXiv:2509.01322,
Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322,
-
[5]
Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, et al. Grove moe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,
-
[6]
Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models.arXiv preprint arXiv:2405.14297,
-
[7]
Moe lens–an expert is all you need.arXiv preprint arXiv:2603.05806,
Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, and Shivam Raval. Moe lens–an expert is all you need.arXiv preprint arXiv:2603.05806,
-
[8]
Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models
Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6223–6235,
work page 2024
-
[9]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20251026
-
[12]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
16 Post-Trained MoE Can Skip Half Experts via Self-Distillation Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Generalizing Verifiable Instruction Following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy
Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy. arXiv preprint arXiv:2506.13284,
-
[18]
Llama-nemotron: Efficient reasoning models, 2025
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949,
-
[19]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[20]
Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,
-
[21]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...
work page 2025
-
[23]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
17 Post-Trained MoE Can Skip Half Experts via Self-Distillation Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs.arXiv preprint arXiv:2407.00945,
-
[27]
Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334,
-
[28]
Retraining- free merging of sparse moe via hierarchical clustering.arXiv preprint arXiv:2410.08589,
I Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee, et al. Retraining- free merging of sparse moe via hierarchical clustering.arXiv preprint arXiv:2410.08589,
-
[29]
Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, and Jianfeng Gao. Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 86–102,
work page 2025
-
[30]
JiaweiHao,ZhiweiHao,JianyuanGuo,LiShen,YongLuo,HanHu,andDanZeng. Lightmoe: Reducing mixture-of-experts redundancy through expert replacing.arXiv preprint arXiv:2603.12645,
-
[31]
Hanchi Sun, Yixin Liu, Yonghui Wu, and Lichao Sun. Expert threshold routing for autoregressive language modeling with dynamic computation allocation and load balancing.arXiv preprint arXiv:2603.11535,
-
[32]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,
work page 2016
-
[34]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[35]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computa- tional Linguistics: ACL 2023, pages 8003–8017,
work page 2023
-
[36]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, and Hiroto Takegawa. Rad: Redundancy- aware distillation for hybrid models via self-speculative decoding.arXiv preprint arXiv:2505.22135,
-
[40]
Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156,
-
[41]
Ladimo: Layer-wise distillation inspired moefier.arXiv preprint arXiv:2408.04278,
Sungyoon Kim, Youngjun Kim, Kihyo Moon, and Minsung Jang. Ladimo: Layer-wise distillation inspired moefier.arXiv preprint arXiv:2408.04278,
-
[42]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901,
work page 2023
-
[43]
The speedup gradually diminishes as sequence length increases. Nevertheless, even at8𝑘, a commonly used long-context setting, ZEDA still achieves approximately 20% speedup, demonstrating its practical usability. Furthermore, ZEDA exhibits greater potential for advanced communication frameworks like DeepEP [Zhao et al., 2025], which we aim to integrate in ...
work page 2025
-
[44]
D.1. Shared MoE Cost Decomposition The MoE FFN and router costs have the same form in both stages; the only difference is the number of tokens processed in the current forward pass. Let𝑛 denote that token count. For the original 22 Post-Trained MoE Can Skip Half Experts via Self-Distillation Table 11|Notation used in the theoretical FLOP analysis. Symbol ...
work page 2023
-
[45]
Table 12|Architectural parameters of Qwen3-30B-A3B used in the FLOP analysis. Symbol𝐻 𝐻 attn 𝑔kv 𝐻𝑒 𝑁 𝑁 𝑍 𝐾 Value2048 4096 1/8 768 128 64 8 To facilitate direct comparison with empirical measurements, we convert the FLOP ratios in Equa- tions (14) and (18) into theoretical speedups by taking their reciprocals. Table 13 reports the resulting prefill and de...
work page 2048
-
[46]
24 Post-Trained MoE Can Skip Half Experts via Self-Distillation Table 13| Comparison between theoretical speedups derived from the FLOP analysis and measured empirical speedups on Qwen3-30B-A3B across different sequence lengths. Length Prefill Speedup Decode Speedup Theoretical Empirical Theoretical Empirical 1024 1.403x 1.141x 1.443x 1.233x 2048 1.341x 1...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.