pith. machine review for the scientific record. sign in

arxiv: 2604.21330 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Teacher-Guided Routing for Sparse Vision Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords mixture of expertssparse modelsteacher-student learningrouting stabilityvision transformersimage classification
0
0 comments X

The pith

A pretrained dense teacher supplies routing pseudo-labels to stabilize the student router in sparse vision Mixture-of-Experts models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse Mixture-of-Experts networks receive gradients only through the experts chosen in the forward pass, leaving the router with sparse, localized signals that produce fluctuating expert assignments and slow convergence. TGR-MoE builds a teacher router directly from the intermediate representations of a pretrained dense model and treats its routing outputs as supervision targets for the student router. This external guidance reduces assignment fluctuations from the earliest training steps and yields higher accuracy on ImageNet-1K and CIFAR-100 while preserving the inference speed of the sparse architecture.

Core claim

Constructing a teacher router from the intermediate representations of a pretrained dense model and using its routing outputs as pseudo-supervision for the student router suppresses frequent routing fluctuations and enables knowledge-guided expert selection from the early stages of training.

What carries the argument

Teacher router built from the pretrained dense model's intermediate representations, whose routing decisions serve as pseudo-supervision signals for the sparse student router.

If this is right

  • Expert assignments remain consistent across training epochs instead of oscillating.
  • Classification accuracy increases on ImageNet-1K and CIFAR-100 under the same sparsity budget.
  • Stable convergence is observed even when only a very small fraction of experts is activated per token.
  • The student router inherits useful selection patterns from the dense teacher without additional inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intermediate-representation supervision could be tested on language-model MoE architectures to check whether routing instability is modality-specific.
  • Annealing the strength of the teacher signal over training might allow the student to eventually surpass the teacher's routing choices.
  • Measuring how closely the student's final routing decisions match the teacher's could serve as a diagnostic for whether the dense model encodes transferable selection knowledge.

Load-bearing premise

Routing scores produced by the dense teacher's intermediate layers constitute suitable and unbiased targets for training the sparse student's router.

What would settle it

A controlled experiment in which a sparse MoE trained with TGR supervision exhibits the same or higher routing fluctuation rate (measured by assignment entropy or switch frequency) and no accuracy gain over an identical baseline trained without the teacher signals.

Figures

Figures reproduced from arXiv: 2604.21330 by Ikuro Sato, Masahiro Kada, Rei Kawakami, Ryota Yoshihashi, Satoshi Ikehata.

Figure 1
Figure 1. Figure 1: Comparison between standard sparse MoE routing (left) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of the proposed TGR-MoE, illus [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of routing consistency during training. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher's intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TGR-MoE, a method for training sparse vision Mixture-of-Experts models that constructs a teacher router from the intermediate representations of a pretrained dense model and uses its routing outputs as pseudo-supervision for the student router. This is intended to mitigate gradient blocking, suppress routing fluctuations, and enable stable, knowledge-guided expert selection from early training stages. The authors claim that the approach yields consistent gains in accuracy and routing consistency on ImageNet-1K and CIFAR-100 while supporting stable training under high sparsity.

Significance. If the empirical results hold and the teacher guidance proves unbiased, TGR-MoE would provide a practical mechanism to address a core optimization difficulty in sparse MoE training, potentially enabling higher sparsity levels in vision models without instability. The method is simple and leverages existing dense pretrained models, which is a strength for reproducibility. Its significance would be higher if ablations showed that the gains are specifically attributable to the teacher supervision rather than generic regularization and if the student router demonstrably diverges productively from the teacher targets.

major comments (2)
  1. Abstract: The claim of 'consistent improvements in accuracy and routing consistency' is central but unsupported by any quantitative results, baselines, ablation details, or statistical tests in the abstract. Without these, the magnitude of the benefit and its dependence on the teacher guidance cannot be evaluated.
  2. Method (teacher router construction): The load-bearing assumption that routing decisions extracted from the dense teacher's intermediate representations constitute high-quality, unbiased pseudo-supervision is not accompanied by evidence that the student router can deviate from these targets when they are suboptimal or that performance degrades without the supervision. The different computation graph of the dense teacher (no expert gating) risks domain mismatch that could either over-constrain or misalign expert selection.
minor comments (1)
  1. The abstract would be strengthened by including at least one key quantitative result (e.g., top-1 accuracy delta or routing consistency metric) to allow readers to gauge the effect size immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we will make to improve the clarity and evidential support of the manuscript.

read point-by-point responses
  1. Referee: Abstract: The claim of 'consistent improvements in accuracy and routing consistency' is central but unsupported by any quantitative results, baselines, ablation details, or statistical tests in the abstract. Without these, the magnitude of the benefit and its dependence on the teacher guidance cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript we will update the abstract to report specific accuracy gains on ImageNet-1K and CIFAR-100, the corresponding improvements in routing consistency metrics, and brief references to the main baselines and ablation settings used in the experiments. revision: yes

  2. Referee: Method (teacher router construction): The load-bearing assumption that routing decisions extracted from the dense teacher's intermediate representations constitute high-quality, unbiased pseudo-supervision is not accompanied by evidence that the student router can deviate from these targets when they are suboptimal or that performance degrades without the supervision. The different computation graph of the dense teacher (no expert gating) risks domain mismatch that could either over-constrain or misalign expert selection.

    Authors: This point is well taken. While the primary experimental results demonstrate consistent gains with TGR-MoE, we acknowledge that direct evidence of productive divergence and the necessity of the teacher signal is currently implicit rather than explicit. In the revision we will add (i) an ablation that removes the teacher pseudo-supervision entirely and reports the resulting drop in both accuracy and routing stability, and (ii) quantitative analysis of the divergence between teacher and student routing decisions across training epochs. We will also expand the method section to clarify the construction of the teacher router (a lightweight gating head applied to the dense model's intermediate features) and discuss how this design reduces, though does not eliminate, the computation-graph mismatch. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core proposal constructs a teacher router from a fixed, pretrained dense model's intermediate representations and applies its outputs as pseudo-supervision to the student router. This is an external knowledge-transfer mechanism rather than a self-referential loop. No equations, parameter fits, or derivation steps in the abstract or described method reduce the claimed stabilization or accuracy gains to quantities defined by the student itself. The teacher is pretrained independently on the same data distribution but with a different (dense) architecture, providing an independent signal. No self-citation chains, ansatz smuggling, or renaming of known results appear as load-bearing elements. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; the central claim rests on the domain assumption that teacher-derived routing signals are beneficial and transferable.

axioms (1)
  • domain assumption Routing outputs from a pretrained dense model's intermediate representations provide useful and non-misleading pseudo-supervision for a sparse student MoE router.
    This assumption underpins the entire teacher-guided approach and is not derived or proven in the abstract.

pith-pipeline@v0.9.0 · 5543 in / 1183 out tokens · 32248 ms · 2026-05-09T21:48:55.214018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Read-me: refactorizing llms as router-decoupled mix- ture of experts with system co-design

    Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, and Zhangyang Wang. Read-me: refactorizing llms as router-decoupled mix- ture of experts with system co-design. InProceedings of the 38th International Conference on Neural Information Pro- cessing Systems, Red Hook, NY , USA, 2024. Curran Asso- ciates Inc. 2

  2. [2]

    Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts

    Wenyan Cong, Hanxue Liang, Peihao Wang, Zhiwen Fan, Tianlong Chen, Mukund Varma, Yi Wang, and Zhangyang Wang. Enhancing nerf akin to enhancing llms: Generalizable nerf transformer with mixture-of-view-experts. InICCV,

  3. [3]

    Randaugment: Practical automated data augmentation with a reduced search space

    Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. Randaugment: Practical automated data augmentation with a reduced search space. InNeurIPS, 2020. 4, 2

  4. [4]

    StableMoE: Stable routing strategy for mixture of experts

    Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. StableMoE: Stable routing strategy for mixture of experts. InProceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 7085–7095, Dublin, Ireland, 2022. Association for Computational Linguistics. 2, 3

  5. [5]

    Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y . Wu, Zhenda Xie, Y . K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wenfeng Liang. Deepseek- moe: Towards ultimate expert specialization in mixture-of- experts language models. InProceedings of the 62nd An- nual Meeting of the As...

  6. [6]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  7. [7]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fe- dus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kath- leen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Che...

  8. [8]

    Switch transformers: scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search (JMLR), 23(1), 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search (JMLR), 23(1), 2022. 1, 2

  9. [9]

    Ernie-vilg 2.0: Improving text-to- image diffusion model with knowledge-enhanced mixture- of-denoising-experts

    Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Wei- chong Yin, Shikun Feng, Yu Sun, Li Chen, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to- image diffusion model with knowledge-enhanced mixture- of-denoising-experts. InCVPR, pages 10135–10145, 2023. 1, 2

  10. [10]

    Class attention transfer based knowledge distillation

    Ziyao Guo, Haonan Yan, Hui Li, and Xiaodong Lin. Class attention transfer based knowledge distillation. InCVPR, pages 11868–11877, 2023. 2

  11. [11]

    Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Ma- heswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, and Ed H. Chi. Dselect-k: differentiable selec- tion in the mixture of experts with applications to multi-task learning. InProceedings of the 35th International Confer- ence on Neural Information Processing Systems, Red Hook, NY , USA, 2021...

  12. [12]

    Distilling the knowledge in a neural network, 2015

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. 2

  13. [13]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  14. [14]

    Curran Associates Inc. 2

  15. [15]

    Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991. 2

  16. [16]

    Damex: Dataset-aware mixture-of-experts for visual under- standing of mixture-of-datasets

    Yash Jain, Harkirat Behl, Zsolt Kira, and Vibhav Vineet. Damex: Dataset-aware mixture-of-experts for visual under- standing of mixture-of-datasets. InNeurIPS, 2023. 2

  17. [17]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guil- laume Lample, L ´elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak,...

  18. [18]

    Robustifying routers against in- put perturbations for sparse mixture-of-experts vision trans- formers.IEEE Open Journal of Signal Processing, 6:276– 283, 2025

    Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata, Rei Kawakami, and Ikuro Sato. Robustifying routers against in- put perturbations for sparse mixture-of-experts vision trans- formers.IEEE Open Journal of Signal Processing, 6:276– 283, 2025. 2

  19. [19]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  20. [20]

    Krizhevsky and G

    A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images.Master’s thesis, Department of Computer Science, University of Toronto, 2009. 2, 4, 6

  21. [21]

    On information and sufficiency.The annals of mathematical statistics, 22(1): 79–86, 1951

    Solomon Kullback and Richard A Leibler. On information and sufficiency.The annals of mathematical statistics, 22(1): 79–86, 1951. 4

  22. [22]

    Gshard: Scaling giant mod- els with conditional computation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant mod- els with conditional computation and automatic sharding. In ICLR, 2020. 1, 2

  23. [23]

    Base layers: Simplifying training of large, sparse models

    Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. InICML, 2021. 2

  24. [24]

    Dynamic expert specialization: Towards catastrophic forgetting-free multi-domain MoE adaptation

    Junzhuo Li, Bo Wang, Xiuze Zhou, and Xuming Hu. Dynamic expert specialization: Towards catastrophic forgetting-free multi-domain MoE adaptation. InProceed- ings of the 2025 Conference on Empirical Methods in Nat- ural Language Processing, pages 18489–18504, Suzhou, China, 2025. Association for Computational Linguistics. 3

  25. [25]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. 2

  26. [26]

    Switch-NeRF: Learning scene decomposition with mixture of experts for large-scale neural radiance fields

    Zhenxing Mi and Dan Xu. Switch-NeRF: Learning scene decomposition with mixture of experts for large-scale neural radiance fields. InICLR, 2023. 2

  27. [27]

    arXiv preprint arXiv:2306.03745 , year=

    Mohammed Muqeeth, Haokun Liu, and Colin Raffel. Soft merging of experts with adaptive routing.CoRR, abs/2306.03745, 2023. 2

  28. [28]

    Adaptive soft weight tying using gaussian mixtures

    Steven Nowlan and Geoffrey E Hinton. Adaptive soft weight tying using gaussian mixtures. InNeurIPS, 1991. 2

  29. [29]

    Parkhi, Andrea Vedaldi, Andrew Zisserman, and C

    Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V . Jawahar. Cats and dogs. InIEEE Conference on Com- puter Vision and Pattern Recognition, 2012. 2, 4, 6

  30. [30]

    On the adversarial robustness of mixture of experts

    Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme, Pran- jal Awasthi, and Srinadh Bhojanapalli. On the adversarial robustness of mixture of experts. InNeurIPS, 2022. 2

  31. [31]

    From sparse to soft mixtures of experts

    Joan Puigcerver, Carlos Riquelme Ruiz, Basil Mustafa, and Neil Houlsby. From sparse to soft mixtures of experts. In ICLR, 2024. 1, 2, 5, 6

  32. [32]

    DeepSpeed-MoE: Advanc- ing mixture-of-experts inference and training to power next- generation AI scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advanc- ing mixture-of-experts inference and training to power next- generation AI scale. InICML, pages 18332–18346. PMLR,

  33. [33]

    Imagenet-21k pretraining for the masses

    Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zel- nik. Imagenet-21k pretraining for the masses. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021. 4

  34. [34]

    Scaling vision with sparse mix- ture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts. InNeurIPS, 2021. 1, 2, 3, 4, 6

  35. [35]

    Hash layers for large sparse models

    Stephen Roller, Sainbayar Sukhbaatar, Jason Weston, et al. Hash layers for large sparse models. InNeurIPS, 2021. 2

  36. [36]

    Fitnets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. InICLR, 2015. 2

  37. [37]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition chal- lenge.CoRR, abs/1409.0575, 2014. 2, 4, 6

  38. [38]

    Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyr´e, and Mathieu Blondel

    Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyr´e, and Mathieu Blondel. Fast, differentiable and sparse top-k: a convex analysis perspective. InICML, 2023. 2

  39. [39]

    Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer. InICLR, 2017. 1, 2

  40. [40]

    Varia- tional mixture-of-experts autoencoders for multi-modal deep generative models

    Yuge Shi, Siddharth N, Brooks Paige, and Philip Torr. Varia- tional mixture-of-experts autoencoders for multi-modal deep generative models. InAdvances in Neural Information Pro- cessing Systems. Curran Associates, Inc., 2019. 2

  41. [41]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InICML, pages 10347–10357. PMLR, 2021. 2, 4, 1

  42. [42]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Herv ´e J ´egou. Deit iii: Revenge of the vit. InComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, page 516–533, Berlin, Heidelberg,

  43. [43]

    Springer-Verlag. 4, 2

  44. [44]

    Mixture of experts for image classi- fication: What’s the sweet spot?TMLR, 2025

    Mathurin VIDEAU, Alessandro Leite, Marc Schoenauer, and Olivier Teytaud. Mixture of experts for image classi- fication: What’s the sweet spot?TMLR, 2025. 1

  45. [45]

    Remoe: Fully dif- ferentiable mixture-of-experts with relu routing

    Ziteng Wang, Jun Zhu, and Jianfei Chen. Remoe: Fully dif- ferentiable mixture-of-experts with relu routing. InICLR,

  46. [46]

    Openmoe: an early ef- fort on open mixture-of-experts language models

    Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, and Yang You. Openmoe: an early ef- fort on open mixture-of-experts language models. InICML. JMLR.org, 2025. 2

  47. [47]

    Cutmix: Regu- larization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. InICCV, 2019. 4, 2

  48. [48]

    Dauphin, and David Lopez-Paz

    Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimiza- tion. InICLR, 2018. 4, 2

  49. [49]

    Robust mixture-of-expert training for convo- lutional neural networks

    Yihua Zhang, Ruisi Cai, Tianlong Chen, Guanhua Zhang, Huan Zhang, Pin-Yu Chen, Shiyu Chang, Zhangyang Wang, and Sijia Liu. Robust mixture-of-expert training for convo- lutional neural networks. InICCV, 2023. 2

  50. [50]

    Decoupled knowledge distillation

    Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. InCVPR, pages 11953–11962, 2022. 2

  51. [51]

    Mixture-of-experts with expert choice routing

    Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice routing. InNeurIPS, 2022. 2, 5, 6, 1

  52. [52]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. St- moe: Designing stable and transferable sparse expert mod- els.arXiv preprint arXiv:2202.08906, 2022. 1, 2, 5, 6 Teacher-Guided Routing for Sparse Vision Mixture-of-Experts Supplementary Material H. Hyperparameter Details Table S8 summarizes all h...