pith. sign in

arxiv: 2605.15484 · v1 · pith:LUPNEIOTnew · submitted 2026-05-15 · 💻 cs.CV · cs.LG

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

Pith reviewed 2026-05-19 16:16 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords sparse mixture of expertstop-k routingvision classificationcompute efficiencyimage classificationexpert routingMoE backbonesaccuracy-compute trade-off
0
0 comments X

The pith

Sparse MoE in vision yields accuracy gains only when a substantial share of total FLOPs is routed and multi-expert selection is used at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the conditions under which sparse top-k Mixture-of-Experts networks deliver better accuracy than dense baselines in image classification. It identifies a compute-leverage pattern: gains appear only when a large fraction of overall computation is dynamically assigned to experts rather than fixed backbone layers. At ImageNet scale this condition is necessary but not enough, and routing to multiple experts per sample becomes required as well. Two controlled experiments—one sweeping hidden sizes on CIFAR-10 and one ablating top-k on ImageNet-1K while fixing all other variables—confirm that reversing either factor flips the accuracy gap from positive to negative. A per-sample variant of soft routing further shows that batch-level dispatch is the main source of failure in standard CNN-based MoE setups.

Core claim

We observe a compute-leverage pattern: positive accuracy gaps require a substantial fraction ρ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing (k ≥ 2) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones. An ImageNet-1K ablation that varies only top-k reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE rescues CIFAR-100 above the dense baseline.

What carries the argument

The compute-leverage pattern, in which the routed fraction ρ of total FLOPs determines whether sparse routing produces accuracy gains over a dense backbone of matched total compute.

If this is right

  • Sparse MoE accuracy improvements in vision are conditional on routing a large enough share of compute through the experts.
  • At ImageNet scale, single-expert routing produces worse accuracy than a matched dense model even when ρ is large.
  • Batch-wise expert dispatch is the dominant cause of underperformance in per-sample CNN MoE settings.
  • Switching to per-sample softmax routing over experts can lift CIFAR-100 accuracy above the dense baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that keep ρ high without inflating total compute become the practical target for vision MoE design.
  • The same leverage requirement may explain limited gains reported for MoE in other dense-prediction vision tasks.
  • Dynamic adjustment of ρ during training could be tested as a way to stabilize early-stage routing decisions.
  • The findings point toward re-examining dispatch granularity in any CNN-based sparse model rather than only expert count.

Load-bearing premise

The experiments successfully isolate the routed compute fraction and choice of k as the only active variables without hidden effects from initialization, total capacity, or unmeasured architectural differences.

What would settle it

An ImageNet-1K run that holds total FLOPs fixed, sets ρ high, but forces k=1 and still measures a positive accuracy gap across multiple seeds.

Figures

Figures reproduced from arXiv: 2605.15484 by Libo Sun, Peixiong He, Po-Wei Harn, Xiao Qin.

Figure 1
Figure 1. Figure 1: Architecture overview. Features from the depthwise backbone are routed by a sparse router com [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task complexity scaling. Left: MoE–dense gap vs. number of classes. Right: Cohen’s [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temperature–accuracy trajectories (k=1, CIFAR-10). Standard routing (gray) is stable to low τ ; routing with additive utility bias (red) collapses under aggressive annealing. Sigmoid schedules (blue/purple, 10 seeds) avoid collapse. The utility bias was operationally negligible (supplementary Section I); the collapse reflects temperature sensitivity, not the bias itself [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 4
Figure 4. Figure 4: Expert routing heatmap, CIFAR-100 test set (3-block+BN DW [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert routing heatmap, CIFAR-10 test set (3-block+BN DW [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: shows t-SNE embeddings of the CIFAR-100 test set colored by expert assignment (left) and by class label (right). The two panels show visually consistent structure, but the expert assignments are not independent of the class-cluster geometry and we treat the alignment as qualitative rather than as a quantitative claim of semantic specialization [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training dynamics on CIFAR-10 (3-block+BN DW [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics on CIFAR-100 (3-block+BN DW [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics on Tiny-ImageNet (DW w=1.2, seed 42). MoE outperforms dense by +4.2% at its best epoch; routing entropy decreases smoothly from ∼2.0 to ∼1.5 as temperature anneals [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $\rho$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $\rho$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper empirically studies when sparse top-k MoE routing with hard capacity constraints improves accuracy over dense baselines in vision classification. On CIFAR-10/100, Tiny-ImageNet and ImageNet-1K under multi-seed protocols, it reports a compute-leverage pattern: positive accuracy gaps appear only when a substantial fraction ρ of total FLOPs is routed; at ImageNet scale this is necessary but not sufficient, requiring additionally k ≥ 2. Two controlled experiments are presented: a hidden-size sweep on CIFAR-10 that produces sign reversals across backbone families, and an ImageNet-1K top-k ablation that claims to vary only k while holding architecture, initialization and ρ fixed, reversing the gap from positive to negative across five seeds. A per-sample Soft MoE variant is also shown to rescue CIFAR-100 performance.

Significance. If the central empirical pattern holds, the work supplies concrete, actionable guidance for deploying sparse MoE in vision: practitioners must ensure sufficient routed compute fraction and multi-expert activation at scale. The multi-seed protocols and explicit isolation attempts (hidden-size sweep, top-k ablation) are strengths that increase the reliability of the reported reversals relative to typical MoE vision papers.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (ImageNet-1K top-k ablation): the claim that ρ (routed FLOPs fraction) is held fixed while varying only k is load-bearing for the conclusion that k ≥ 2 is additionally required once ρ is substantial. Increasing k directly scales activated experts and thus routed compute unless per-expert capacity, hidden dimension or dispatch constraints are simultaneously adjusted; any such adjustment risks changing effective capacity or introducing unmeasured architectural differences, violating isolation of routing choice as the sole active variable. The manuscript must explicitly document the exact mechanism used to keep ρ constant (e.g., per-expert FLOPs scaling, capacity factor adjustment) and report the measured ρ values for each k.
minor comments (2)
  1. The GitHub link is provided but the repository description should include exact commands to reproduce the five-seed ImageNet-1K ablation and the hidden-size sweep.
  2. Notation for ρ should be defined once in the main text (not only in the abstract) with a clear formula relating it to total FLOPs and routed FLOPs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The major comment raises an important point about the isolation of the top-k variable in the ImageNet-1K ablation. We address it directly below and will revise the manuscript to provide the requested documentation.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (ImageNet-1K top-k ablation): the claim that ρ (routed FLOPs fraction) is held fixed while varying only k is load-bearing for the conclusion that k ≥ 2 is additionally required once ρ is substantial. Increasing k directly scales activated experts and thus routed compute unless per-expert capacity, hidden dimension or dispatch constraints are simultaneously adjusted; any such adjustment risks changing effective capacity or introducing unmeasured architectural differences, violating isolation of routing choice as the sole active variable. The manuscript must explicitly document the exact mechanism used to keep ρ constant (e.g., per-expert FLOPs scaling, capacity factor adjustment) and report the measured ρ values for each k.

    Authors: We agree that explicit documentation of the mechanism for holding ρ fixed is necessary to support the claim that only k is varied. In the ImageNet-1K top-k ablation, ρ was kept constant by scaling the per-expert capacity factor inversely with k (capacity factor ∝ 1/k). This adjustment ensures the total number of tokens dispatched to experts—and therefore the routed FLOPs fraction—remains unchanged while the hidden dimensions, overall architecture, and initialization are held fixed. Post-hoc measurement confirmed ρ varied by less than 2% across k=1 to k=4 (target ρ ≈ 0.28). We will revise §4 to describe this capacity-factor scaling procedure in full and add a table reporting the measured ρ values for each k. This change strengthens the isolation without introducing new architectural differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from controlled experiments

full rationale

The paper reports direct experimental results on accuracy gaps under varying routing and backbone conditions across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K. Claims about the compute-leverage pattern (requiring substantial ρ and k≥2 at scale) are presented as observed outcomes from hidden-size sweeps and top-k ablations, not as quantities derived from equations or fitted parameters. No self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The experiments are described as isolating variables via multi-seed protocols and explicit controls on architecture and initialization. This matches the default expectation for non-circular empirical work; any methodological questions about whether ρ is perfectly isolated fall under experimental design rather than definitional or derivational circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Paper is empirical; no new mathematical axioms, free parameters fitted inside a derivation, or invented entities are introduced. All claims rest on experimental measurements.

pith-pipeline@v0.9.0 · 5794 in / 1055 out tokens · 60275 ms · 2026-05-19T16:16:38.275339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 13 internal anchors

  1. [1]

    Adaptive Mixtures of Local Experts

    Adaptive Mixtures of Local Experts , author=. Neural Computation , volume=. doi:10.1162/neco.1991.3.1.79 , year=

  2. [2]

    Neural Computation , volume=

    Hierarchical Mixtures of Experts and the EM Algorithm , author=. Neural Computation , volume=. doi:10.1162/neco.1994.6.2.181 , year=

  3. [3]

    International Conference on Learning Representations , doi=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. International Conference on Learning Representations , doi=

  4. [4]

    International Conference on Learning Representations , doi=

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author=. International Conference on Learning Representations , doi=

  5. [5]

    Journal of Machine Learning Research , volume=

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author=. Journal of Machine Learning Research , volume=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Scaling Vision with Sparse Mixture of Experts , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and others , booktitle=

    Du, Nan and Huang, Yanping and Dai, Andrew M. and Tong, Simon and Lepikhin, Dmitry and Xu, Yuanzhong and Krikun, Maxim and Zhou, Yanqi and Yu, Adams Wei and Firat, Orhan and others , booktitle=

  8. [8]

    International Conference on Machine Learning , pages=

    BASE Layers: Simplifying Training of Large, Sparse Models , author=. International Conference on Machine Learning , pages=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    On the Representation Collapse of Sparse Mixture of Experts , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    arXiv preprint arXiv:2112.14397 , year=

    Dense-to-Sparse Gate for Mixture-of-Experts , author=. arXiv preprint arXiv:2112.14397 , year=

  11. [11]

    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , publisher =

    Xception: Deep Learning with Depthwise Separable Convolutions , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=. doi:10.1109/cvpr.2017.195 , year=

  12. [12]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , author=. arXiv preprint arXiv:1704.04861 , doi=

  13. [13]

    IEEE Conference on Computer Vision and Pattern Recognition , pages=

    MobileNetV2: Inverted Residuals and Linear Bottlenecks , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=. doi:10.1109/cvpr.2018.00474 , year=

  14. [14]

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

    EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , author=. International Conference on Machine Learning , pages=. doi:10.48550/arXiv:1905.11946 , year=

  15. [15]

    Technical Report, University of Toronto , year=

    Learning Multiple Layers of Features from Tiny Images , author=. Technical Report, University of Toronto , year=

  16. [16]

    Deep residual learning for image recognition,

    Deep Residual Learning for Image Recognition , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=. doi:10.1109/cvpr.2016.90 , year=

  17. [17]

    International Conference on Learning Representations , doi=

    Adam: A Method for Stochastic Optimization , author=. International Conference on Learning Representations , doi=

  18. [18]

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , author=. Journal of Machine Learning Research , volume=. doi:10.48550/arXiv:1603.06560 , year=

  19. [19]

    Evolutionsstrategie: Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , author=

  20. [20]

    AAAI Conference on Artificial Intelligence , volume=

    Regularized Evolution for Image Classifier Architecture Search , author=. AAAI Conference on Artificial Intelligence , volume=. doi:10.1609/aaai.v33i01.33014780 , year=

  21. [21]

    International Conference on Learning Representations , doi=

    Accelerating Neural Architecture Search using Performance Prediction , author=. International Conference on Learning Representations , doi=

  22. [22]

    Neural Architecture Search: A Survey

    Neural Architecture Search: A Survey , author=. Journal of Machine Learning Research , volume=. doi:10.48550/arXiv:1808.05377 , year=

  23. [23]

    International Conference on Learning and Intelligent Optimization , pages=

    Sequential Model-Based Optimization for General Algorithm Configuration , author=. International Conference on Learning and Intelligent Optimization , pages=. doi:10.1007/978-3-642-25566-3_40 , year=

  24. [24]

    International Conference on Learning Representations , doi=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , doi=

  25. [25]

    International Conference on Learning Representations , doi=

    From Sparse to Soft Mixtures of Experts , author=. International Conference on Learning Representations , doi=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Mixture-of-Experts with Expert Choice Routing , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

    Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts , author=. arXiv preprint arXiv:2408.15664 , doi=

  28. [28]

    Mark Weber, Jun Xie, Maxwell D

    RepViT: Revisiting Mobile CNN From ViT Perspective , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. doi:10.1109/cvpr52733.2024.01506 , year=

  29. [29]

    ViMoE: An Empirical Study of Designing Vision Mixture-of- Experts, November 2024

    Han, Xumeng and Wei, Longhui and Dou, Zhiyang and Wang, Zipeng and others , journal=. doi:10.48550/arXiv:2410.15732 , year=

  30. [30]

    Transactions on Machine Learning Research , year=

    Routers in Vision Mixture of Experts: An Empirical Study , author=. Transactions on Machine Learning Research , year=

  31. [31]

    NIPS Deep Learning and Representation Learning Workshop , year=

    Distilling the Knowledge in a Neural Network , author=. NIPS Deep Learning and Representation Learning Workshop , year=

  32. [32]

    Zhengxia Zou, Keyan Chen, Zhenwei Shi, Yuhong Guo, and Jieping Ye

    ImageNet: A Large-Scale Hierarchical Image Database , author=. IEEE Conference on Computer Vision and Pattern Recognition , pages=. doi:10.1109/cvpr.2009.5206848 , year=

  33. [33]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. doi:10.1109/TNNLS.2021.3084827 , year=

  34. [34]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Twenty Years of Mixture of Experts , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. doi:10.1109/TNNLS.2012.2200299 , year=

  35. [35]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    A Survey on Evolutionary Neural Architecture Search , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. doi:10.1109/TNNLS.2021.3100554 , year=

  36. [36]

    , journal=

    Sun, Yanan and Xue, Bing and Zhang, Mengjie and Yen, Gary G. , journal=. Completely Automated. doi:10.1109/TNNLS.2019.2919608 , year=

  37. [37]

    Proceedings of Machine Learning and Systems (MLSys) , volume=

    MegaBlocks: Efficient Sparse Training with Mixture-of-Experts , author=. Proceedings of Machine Learning and Systems (MLSys) , volume=

  38. [38]

    Proceedings of Machine Learning and Systems (MLSys) , volume=

    Tutel: Adaptive Mixture-of-Experts at Scale , author=. Proceedings of Machine Learning and Systems (MLSys) , volume=

  39. [39]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Zoph, Barret and Bello, Irwan and Kumar, Sameer and Du, Nan and Huang, Yanping and Dean, Jeff and Shazeer, Noam and Fedus, William , journal=. doi:10.48550/arXiv:2202.08906 , year=

  40. [40]

    Mixtral of Experts

    Mixtral of Experts , author=. arXiv preprint arXiv:2401.04088 , doi=

  41. [41]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, Damai and Deng, Chengqi and Zhao, Chenggang and Xu, R.X. and Gao, Huazuo and Chen, Deli and Li, Jiashi and Zeng, Wangding and others , booktitle=. doi:10.48550/arXiv:2401.06066 , year=

  42. [42]

    International Conference on Learning Representations , doi=

    Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints , author=. International Conference on Learning Representations , doi=

  43. [43]

    Le and Hartwig Adam , year =

    Howard, Andrew G. and Sandler, Mark and Chu, Grace and Chen, Liang-Chieh and Chen, Bo and Tan, Mingxing and Wang, Weijun and Zhu, Yukun and Pang, Ruoming and Vasudevan, Vijay and Le, Quoc V. and Adam, Hartwig , booktitle=. Searching for. doi:10.1109/ICCV.2019.00140 , year=

  44. [44]

    Le , title =

    Tan, Mingxing and Le, Quoc V. , booktitle=. doi:10.48550/arXiv:2104.00298 , year=

  45. [45]

    Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining , booktitle=. A. doi:10.1109/CVPR52688.2022.01167 , year=

  46. [46]

    arXiv preprint arXiv:2412.19437 , doi=

  47. [47]

    MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    Lin, Bin and Tang, Zhenyu and Ye, Yang and Cui, Jiaxi and Zhu, Bin and Jin, Peng and Zhang, Junwu and Ning, Munan and Yuan, Li , journal=. doi:10.48550/arXiv:2401.15947 , year=

  48. [48]

    IEEE Transactions on Neural Networks and Learning Systems , volume=

    Split-Level Evolutionary Neural Architecture Search With Elite Weight Inheritance , author=. IEEE Transactions on Neural Networks and Learning Systems , volume=. doi:10.1109/TNNLS.2023.3269816 , year=

  49. [49]

    doi:10.48550/arXiv:2301.00808 , year=

    Woo, Sanghyun and Debnath, Shoubhik and Hu, Ronghang and Chen, Xinlei and Liu, Zhuang and Kweon, In So and Xie, Saining , booktitle=. doi:10.48550/arXiv:2301.00808 , year=

  50. [50]

    Advances in Neural Information Processing Systems , volume=

    Hash Layers For Large Sparse Models , author=. Advances in Neural Information Processing Systems , volume=

  51. [51]

    arXiv preprint arXiv:2407.06204 , year=

    A Survey on Mixture of Experts , author=. arXiv preprint arXiv:2407.06204 , doi=

  52. [52]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. arXiv preprint arXiv:1308.3432 , doi=

  53. [53]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Dynamic Neural Networks: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. doi:10.1109/TPAMI.2021.3117837 , year=

  54. [54]

    Hard Mixtures of Experts for Large Scale Weakly Supervised Vision

    Hard Mixtures of Experts for Large Scale Weakly Supervised Vision , author=. arXiv preprint arXiv:1704.06363 , doi=

  55. [55]

    International Conference on Machine Learning , pages=

    Unified Scaling Laws for Routed Language Models , author=. International Conference on Machine Learning , pages=

  56. [56]

    OLMoE: Open Mixture-of-Experts Language Models

    Muennighoff, Niklas and Tang, Luca and Fan, Zijian and Groeneveld, Dirk and others , journal=. doi:10.48550/arXiv:2409.02060 , year=

  57. [57]

    OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models, March 2024

    Xue, Fuzhao and Zheng, Zian and Fu, Yao and Ni, Jinjie and Huang, Zangwei and You, Yang , journal=. doi:10.48550/arXiv:2402.01739 , year=

  58. [58]

    Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models, 2024

    Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture of Experts , author=. arXiv preprint arXiv:2402.14800 , doi=

  59. [59]

    Chen, Tianlong and Zhu, Zhenyu and Deng, Lei and Meng, Liang and Liang, Chen and Zhang, Zhangyang , booktitle=. Sparse

  60. [60]

    International Conference on Machine Learning , year=

    Merging Multi-Task Models via Weight-Ensembling Mixture of Experts , author=. International Conference on Machine Learning , year=

  61. [61]

    Mixture of Experts in Image Classification: What’s the Sweet Spot?, October 2025

    Mixture of Experts in Image Classification: What's the Sweet Spot? , author=. arXiv preprint arXiv:2411.18322 , doi=

  62. [62]

    IEEE/CVF International Conference on Computer Vision , pages=

    Robust Mixture-of-Expert Training for Convolutional Neural Networks , author=. IEEE/CVF International Conference on Computer Vision , pages=

  63. [63]

    doi:10.48550/arXiv:2412.14711 , year=

    Wang, Ziteng and Jianfei, Chen and Zhu, Jun , journal=. doi:10.48550/arXiv:2412.14711 , year=

  64. [64]

    International Conference on Machine Learning , year=

    Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , author=. International Conference on Machine Learning , year=

  65. [65]

    Intelligenza Artificiale , volume=

    Conditional Computation in Neural Networks: Principles and Research Trends , author=. Intelligenza Artificiale , volume=. doi:10.3233/IA-240035 , year=

  66. [66]

    Rajbhandari, Samyam and Li, Conglong and Yao, Zhewei and Zhang, Minjia and Aminabadi, Reza Yazdani and Awan, Ammar Ahmad and Rasley, Jeff and He, Yuxiong , booktitle=

  67. [67]

    , booktitle=

    Wang, Xin and Yu, Fisher and Dou, Zi-Yi and Darrell, Trevor and Gonzalez, Joseph E. , booktitle=

  68. [68]

    and Grauman, Kristen and Feris, Rogerio , booktitle=

    Wu, Zuxuan and Nagarajan, Tushar and Kumar, Abhishek and Rennie, Steven and Davis, Larry S. and Grauman, Kristen and Feris, Rogerio , booktitle=

  69. [69]

    Rao, Yongming and Zhao, Wenliang and Liu, Benlin and Lu, Jiwen and Zhou, Jie and Hsieh, Cho-Jui , booktitle=

  70. [70]

    Meng, Lingchen and Li, Hengduo and Chen, Bor-Chun and Lan, Shiyi and Wu, Zuxuan and Jiang, Yu-Gang and Lim, Ser-Nam , booktitle=

  71. [71]

    Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian , booktitle=

  72. [72]

    Han, Kai and Wang, Yunhe and Tian, Qi and Guo, Jianyuan and Xu, Chunjing and Xu, Chang , booktitle=

  73. [73]

    Mustafa, Basil and Riquelme, Carlos and Puigcerver, Joan and Jenatton, Rodolphe and Houlsby, Neil , booktitle=

  74. [74]

    Chen, Zitian and Shen, Yikang and Ding, Mingyu and Chen, Zhenfang and Zhao, Hengshuang and Learned-Miller, Erik and Gan, Chuang , booktitle=