pith. sign in

arxiv: 2605.18643 · v1 · pith:74DINWF6new · submitted 2026-05-18 · 💻 cs.LG · cs.AI· cs.CL

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Mixture-of-Expertsself-distillationdynamic routinginference efficiencypost-training adaptationexpert skippingsparse activationmodel compression
0
0 comments X

The pith

Post-trained static MoE models can be turned into dynamic ones that skip over half their experts with almost no accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ZEDA, a framework that converts already-trained static Mixture-of-Experts models into dynamic versions capable of skipping unnecessary experts at inference time. It achieves this by injecting parameter-free zero-output experts into each layer and running two-stage self-distillation where the original model acts as a frozen teacher, plus a group-level balancing loss to stabilize training. On Qwen3-30B-A3B and GLM-4.7-Flash, the method cuts expert FLOPs by more than 50 percent across 11 benchmarks in math, code, and instruction following while keeping accuracy nearly unchanged. It also beats prior dynamic MoE approaches by several points and yields roughly 1.2 times faster end-to-end inference.

Core claim

Zero-Expert Self-Distillation Adaptation (ZEDA) transforms post-trained static MoE models into efficient dynamic ones by injecting parameter-free zero-output experts into each MoE layer and adapting the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss, outperforming the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20 times end-to-end inference speedup.

What carries the argument

The injection of parameter-free zero-output experts into each MoE layer together with two-stage self-distillation from the frozen original static MoE as teacher and a group-level balancing loss.

If this is right

  • Already-trained static MoE models can be converted to dynamic routing without full retraining from scratch.
  • Expert FLOPs can be reduced by more than half while accuracy stays nearly the same on math, code, and instruction tasks.
  • The resulting dynamic models outperform earlier dynamic MoE baselines by 4 to 6 points on the tested models.
  • End-to-end inference speed improves by about 1.2 times after the adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many computations performed by experts in a trained MoE may be redundant for a large fraction of inputs, allowing safe bypass at serving time.
  • The same zero-expert injection plus distillation pattern could be tested on other sparse architectures beyond standard MoE layers.
  • If the balancing loss proves robust, similar post-training routes might reduce activation costs in future larger MoE variants without changing pre-training.

Load-bearing premise

Adding zero-output experts and distilling from the frozen original model will preserve the original capabilities without major degradation.

What would settle it

An experiment on the same models and benchmarks that shows accuracy falling by more than a few points after ZEDA adaptation or that measured wall-clock speedup falls below 1.1 times due to routing overhead.

read the original abstract

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a framework to convert post-trained static MoE models into dynamic ones. It injects parameter-free zero-output experts into each MoE layer and adapts the model via two-stage self-distillation (using the original static MoE as frozen teacher) plus a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks, ZEDA is reported to eliminate over 50% of expert FLOPs at marginal accuracy loss while outperforming the strongest dynamic MoE baseline by 6.1 and 4.0 points and achieving ~1.20x end-to-end speedup.

Significance. If the empirical results hold under rigorous validation, the work would provide a practical low-cost route to retrofit existing post-trained MoE models for dynamic expert skipping, directly addressing inference cost in large-scale deployments without requiring pre-training from scratch or task-specific fine-tuning.

major comments (3)
  1. Abstract and experimental summary: the headline claim of ≥50% expert-FLOP reduction at marginal accuracy loss (and the 6.1/4.0-point gains) is presented only as a high-level quantitative summary; no error bars, ablation tables, or precise data-exclusion rules are referenced, so the central empirical result cannot be assessed for robustness.
  2. Method description (ZEDA framework): the necessity of injecting parameter-free zero-output experts is asserted to stabilize the router's ability to route easy tokens while preserving output on hard tokens, yet no ablation is reported that compares this choice against simply lowering top-k or adding a learned skip token; this directly bears on whether the reported FLOP savings are attributable to the proposed mechanism.
  3. Results on Qwen3-30B-A3B and GLM-4.7-Flash: the outperformance over the strongest dynamic baseline is stated without per-benchmark tables, statistical significance tests, or breakdown by task category (math/code/instruction), leaving open whether the gains are uniform or concentrated on easier subsets where zero-expert routing is trivial.
minor comments (2)
  1. Clarify the exact formulation of the group-level balancing loss (e.g., provide its equation and hyper-parameter schedule) so that the two-stage distillation procedure can be reproduced.
  2. Add a short paragraph contrasting ZEDA with prior dynamic-MoE adaptation methods that also use distillation, to better position the novelty of the zero-expert injection step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, proposing specific revisions to strengthen the empirical presentation and methodological justification while preserving the core contributions of ZEDA.

read point-by-point responses
  1. Referee: Abstract and experimental summary: the headline claim of ≥50% expert-FLOP reduction at marginal accuracy loss (and the 6.1/4.0-point gains) is presented only as a high-level quantitative summary; no error bars, ablation tables, or precise data-exclusion rules are referenced, so the central empirical result cannot be assessed for robustness.

    Authors: We agree that additional details are needed to allow rigorous assessment of robustness. In the revision we will expand the experimental section and appendix to include (i) error bars computed over multiple random seeds for the main results where compute permits, (ii) explicit ablation tables for the two-stage distillation and balancing loss, and (iii) a precise statement of any data-exclusion or filtering rules applied to the 11 benchmarks. These additions will make the ≥50% FLOP reduction and the 6.1/4.0-point gains directly verifiable. revision: yes

  2. Referee: Method description (ZEDA framework): the necessity of injecting parameter-free zero-output experts is asserted to stabilize the router's ability to route easy tokens while preserving output on hard tokens, yet no ablation is reported that compares this choice against simply lowering top-k or adding a learned skip token; this directly bears on whether the reported FLOP savings are attributable to the proposed mechanism.

    Authors: We acknowledge that an explicit ablation would strengthen the causal link between the zero-output expert design and the observed savings. We will add a new ablation subsection that directly compares (a) our parameter-free zero-output experts, (b) simply lowering top-k on the original model, and (c) introducing a learned skip token. The results will quantify how each variant affects router stability, FLOP reduction, and accuracy, thereby clarifying the contribution of the proposed mechanism. revision: yes

  3. Referee: Results on Qwen3-30B-A3B and GLM-4.7-Flash: the outperformance over the strongest dynamic baseline is stated without per-benchmark tables, statistical significance tests, or breakdown by task category (math/code/instruction), leaving open whether the gains are uniform or concentrated on easier subsets where zero-expert routing is trivial.

    Authors: We agree that per-benchmark granularity and task-category analysis are important for interpreting the gains. In the revised manuscript we will (i) move the full per-benchmark accuracy and FLOP tables to the main body or a prominent appendix, (ii) report statistical significance (paired t-tests or Wilcoxon tests) between ZEDA and the strongest baseline, and (iii) provide a breakdown by task category (math, code, instruction following) showing that the 6.1- and 4.0-point improvements hold across categories rather than being driven solely by easier subsets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation procedure with independent benchmark validation

full rationale

The paper describes ZEDA as an engineering adaptation: parameter-free zero-output experts are injected, followed by two-stage self-distillation against a frozen teacher plus a group-level balancing loss. Reported gains (≥50% expert-FLOP reduction at marginal accuracy loss on 11 benchmarks) are measured directly on held-out tasks for Qwen3-30B-A3B and GLM-4.7-Flash. No equations, fitted parameters, or self-citations are presented as load-bearing derivations that reduce the headline result to a tautology or to quantities defined by the inputs themselves. The method is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that self-distillation from a frozen teacher plus zero-output experts will not introduce new failure modes; no free parameters or invented entities beyond the zero-output experts are quantified in the abstract.

axioms (1)
  • domain assumption Self-distillation from a frozen original MoE teacher preserves downstream capability after architectural augmentation
    Invoked when the original model is used as teacher in the two-stage adaptation.
invented entities (1)
  • parameter-free zero-output experts no independent evidence
    purpose: Stabilize architectural conversion by allowing tokens to bypass experts without changing output distribution
    Introduced in each MoE layer to enable skipping; no independent evidence provided outside the adaptation results.

pith-pipeline@v0.9.0 · 5807 in / 1272 out tokens · 47883 ms · 2026-05-20T11:56:08.189489+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 20 internal anchors

  1. [1]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computa- tion and automatic sharding.arXiv preprint arXiv:2006.16668,

  2. [2]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

  3. [3]

    Moe++: Accelerating mixture-of-experts methods with zero-computation experts.arXiv preprint arXiv:2410.07348,

    Peng Jin, Bo Zhu, Li Yuan, and Shuicheng Yan. Moe++: Accelerating mixture-of-experts methods with zero-computation experts.arXiv preprint arXiv:2410.07348,

  4. [4]

    Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

    Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

  5. [5]

    Grove moe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

    Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, et al. Grove moe: Towards efficient and superior moe llms with adjugate experts.arXiv preprint arXiv:2508.07785,

  6. [6]

    Dynamic mixture of experts: An auto-tuning approach for efficient transformer models.arXiv preprint arXiv:2405.14297,

    Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, and Tao Lin. Dynamic mixture of experts: An auto-tuning approach for efficient transformer models.arXiv preprint arXiv:2405.14297,

  7. [7]

    Moe lens–an expert is all you need.arXiv preprint arXiv:2603.05806,

    Marmik Chaudhari, Idhant Gulati, Nishkal Hundia, Pranav Karra, and Shivam Raval. Moe lens–an expert is all you need.arXiv preprint arXiv:2603.05806,

  8. [8]

    Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models

    Zihao Zeng, Yibo Miao, Hongcheng Gao, Hao Zhang, and Zhijie Deng. Adamoe: Token-adaptive routing with null experts for mixture-of-experts language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6223–6235,

  9. [9]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  10. [10]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,

  11. [11]

    Qwen3 Technical Report

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  12. [12]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471,

  13. [13]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  14. [14]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    16 Post-Trained MoE Can Skip Half Experts via Self-Distillation Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

  15. [15]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

  16. [16]

    Generalizing Verifiable Instruction Following

    Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following.arXiv preprint arXiv:2507.02833,

  17. [17]

    Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy

    Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy. arXiv preprint arXiv:2506.13284,

  18. [18]

    Llama-nemotron: Efficient reasoning models, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949,

  19. [19]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catan- zaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

  20. [20]

    Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

    Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137,

  21. [21]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

  22. [22]

    Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...

  23. [23]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

  24. [24]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    17 Post-Trained MoE Can Skip Half Experts via Self-Distillation Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

  25. [25]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906,

  26. [26]

    Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs.arXiv preprint arXiv:2407.00945,

    Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs.arXiv preprint arXiv:2407.00945,

  27. [27]

    Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334,

    Pingzhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy.arXiv preprint arXiv:2310.01334,

  28. [28]

    Retraining- free merging of sparse moe via hierarchical clustering.arXiv preprint arXiv:2410.08589,

    I Chen, Hsu-Shen Liu, Wei-Fang Sun, Chen-Hao Chao, Yen-Chang Hsu, Chun-Yi Lee, et al. Retraining- free merging of sparse moe via hierarchical clustering.arXiv preprint arXiv:2410.08589,

  29. [29]

    Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts

    Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, and Jianfeng Gao. Diversifying the expert knowledge for task-agnostic pruning in sparse mixture-of-experts. InFindings of the Association for Computational Linguistics: ACL 2025, pages 86–102,

  30. [30]

    Lightmoe: Reducing mixture-of-experts redundancy through expert replacing.arXiv preprint arXiv:2603.12645,

    JiaweiHao,ZhiweiHao,JianyuanGuo,LiShen,YongLuo,HanHu,andDanZeng. Lightmoe: Reducing mixture-of-experts redundancy through expert replacing.arXiv preprint arXiv:2603.12645,

  31. [31]

    Expert threshold routing for autoregressive language modeling with dynamic computation allocation and load balancing.arXiv preprint arXiv:2603.11535,

    Hanchi Sun, Yixin Liu, Yonghui Wu, and Lichao Sun. Expert threshold routing for autoregressive language modeling with dynamic computation allocation and load balancing.arXiv preprint arXiv:2603.11535,

  32. [32]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  33. [33]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327,

  34. [34]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  35. [35]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computa- tional Linguistics: ACL 2023, pages 8003–8017,

  36. [36]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

  37. [37]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

  38. [38]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  39. [39]

    Rad: Redundancy- aware distillation for hybrid models via self-speculative decoding.arXiv preprint arXiv:2505.22135,

    Yuichiro Hoshino, Hideyuki Tachibana, Muneyoshi Inahara, and Hiroto Takegawa. Rad: Redundancy- aware distillation for hybrid models via self-speculative decoding.arXiv preprint arXiv:2505.22135,

  40. [40]

    Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156,

    Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts.arXiv preprint arXiv:2601.22156,

  41. [41]

    Ladimo: Layer-wise distillation inspired moefier.arXiv preprint arXiv:2408.04278,

    Sungyoon Kim, Youngjun Kim, Kihyo Moon, and Minsung Jang. Ladimo: Layer-wise distillation inspired moefier.arXiv preprint arXiv:2408.04278,

  42. [42]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901,

  43. [43]

    Nevertheless, even at8𝑘, a commonly used long-context setting, ZEDA still achieves approximately 20% speedup, demonstrating its practical usability

    The speedup gradually diminishes as sequence length increases. Nevertheless, even at8𝑘, a commonly used long-context setting, ZEDA still achieves approximately 20% speedup, demonstrating its practical usability. Furthermore, ZEDA exhibits greater potential for advanced communication frameworks like DeepEP [Zhao et al., 2025], which we aim to integrate in ...

  44. [44]

    Shared MoE Cost Decomposition The MoE FFN and router costs have the same form in both stages; the only difference is the number of tokens processed in the current forward pass

    D.1. Shared MoE Cost Decomposition The MoE FFN and router costs have the same form in both stages; the only difference is the number of tokens processed in the current forward pass. Let𝑛 denote that token count. For the original 22 Post-Trained MoE Can Skip Half Experts via Self-Distillation Table 11|Notation used in the theoretical FLOP analysis. Symbol ...

  45. [45]

    Table 12|Architectural parameters of Qwen3-30B-A3B used in the FLOP analysis. Symbol𝐻 𝐻 attn 𝑔kv 𝐻𝑒 𝑁 𝑁 𝑍 𝐾 Value2048 4096 1/8 768 128 64 8 To facilitate direct comparison with empirical measurements, we convert the FLOP ratios in Equa- tions (14) and (18) into theoretical speedups by taking their reciprocals. Table 13 reports the resulting prefill and de...

  46. [46]

    24 Post-Trained MoE Can Skip Half Experts via Self-Distillation Table 13| Comparison between theoretical speedups derived from the FLOP analysis and measured empirical speedups on Qwen3-30B-A3B across different sequence lengths. Length Prefill Speedup Decode Speedup Theoretical Empirical Theoretical Empirical 1024 1.403x 1.141x 1.443x 1.233x 2048 1.341x 1...