Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Bohyung Han; Gwangmo Song; Honglak Lee; Pyunghwan Ahn; Sanghyeok Chu; SeungHwan Kim

arxiv: 2604.13508 · v2 · submitted 2026-04-15 · 💻 cs.CV

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Sanghyeok Chu , Pyunghwan Ahn , Gwangmo Song , SeungHwan Kim , Honglak Lee , Bohyung Han This is my paper

Pith reviewed 2026-05-10 14:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords Mixture of ExpertsSparse UpcyclingExpert SpecializationModel InitializationCLIP Vision TransformersSelf-DistillationRouting Behavior

0 comments

The pith

Cluster-aware Upcycling initializes MoE experts from semantic activation clusters to break symmetry and accelerate specialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the symmetry problem in sparse upcycling, where all experts in a Mixture-of-Experts model begin with identical weights copied from a pretrained dense network. It partitions the dense model's input activations into semantic clusters, then initializes each expert using the truncated SVD subspace of activations from its assigned cluster while setting the router weights to the corresponding cluster centroids. An expert-ensemble self-distillation loss is added to provide stable routing signals during early training. On CLIP ViT-B/32 and ViT-B/16 models, this produces higher zero-shot and few-shot accuracy than prior upcycling baselines while yielding more diverse expert representations and more decisive routing.

Core claim

By partitioning dense-model activations into semantic clusters and initializing each expert from the truncated SVD subspace of its cluster (with router weights set to cluster centroids), Cluster-aware Upcycling breaks the initial symmetry among experts and aligns their early specialization with the underlying data distribution, which in turn yields more diverse and disentangled representations together with improved downstream performance.

What carries the argument

Cluster-aware initialization: semantic partitioning of dense activations followed by truncated SVD subspaces for expert weights and centroids for the router, plus an expert-ensemble self-distillation loss.

If this is right

MoE models reach higher zero-shot and few-shot accuracy on vision tasks than standard sparse upcycling.
Expert representations become more diverse and less similar to one another.
The router exhibits more confident, less uniform routing decisions.
Training stability improves through the added self-distillation signal without changing the final architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cluster-derived initialization could shorten the number of training steps needed before MoE models surpass dense baselines.
The method may transfer to language-model upcycling if token embeddings are clustered instead of image activations.
Dynamic re-clustering of activations during later training stages could further reduce expert interference.

Load-bearing premise

Partitioning activations into semantic clusters and seeding experts with the corresponding SVD subspaces will produce stable early specialization that matches the data distribution without new instabilities or heavy hyperparameter tuning.

What would settle it

A controlled run in which the same dense model is upcycled using random subspaces instead of cluster-derived SVD subspaces and the resulting zero-shot accuracy on the CLIP benchmarks is statistically indistinguishable from the cluster-aware version.

Figures

Figures reproduced from arXiv: 2604.13508 by Bohyung Han, Gwangmo Song, Honglak Lee, Pyunghwan Ahn, Sanghyeok Chu, SeungHwan Kim.

**Figure 2.** Figure 2: An illustration of how Cluster-aware Upcycling initializes the MoE layer. (a) Input activations are clustered to obtain whitening [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Expert-ensemble self-distillation (EESD). The dense [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (a) Relative Compactness measures the overlap between intra- and inter-expert variance, where lower values indicate more [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Detailed expert utilization across mixture-of-experts lay [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior. Project page: https://sanghyeokchu.github.io/cluster-aware-upcycling/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a cluster-based SVD initialization for MoE experts that delivers measurable gains over plain upcycling on CLIP tasks, though the source of those gains still needs tighter controls.

read the letter

The punchline is that Cluster-aware Upcycling gives a concrete, low-overhead way to break symmetry when converting a dense ViT into an MoE. By clustering activations, seeding each expert from its cluster's truncated SVD subspace, and initializing the router from centroids, plus adding an ensemble self-distillation loss, the method produces more diverse experts and better zero- and few-shot numbers than standard upcycling on the ViT-B/32 and B/16 CLIP setups. That combination is new relative to prior upcycling work and directly targets the early-training symmetry problem. The empirical side looks solid enough on the surface: consistent outperformance, qualitative evidence of reduced inter-expert similarity, and more confident routing. Those are useful signals for anyone training sparse vision models from dense checkpoints. The soft spots are mostly about verification. The abstract and available description do not include error bars, full ablation tables, or statistical tests, so it is still unclear how much of the lift comes from the cluster-SVD step versus the added distillation loss or generic regularization. The central assumption—that activation clusters will be distinct enough and that the retained SVD directions will align with downstream needs—could fail on data where the dense model's activations lack clear semantic structure. If that happens, the experts may collapse back to symmetry after a few steps and the headline gains would shrink. Experiments are also limited to two modest ViT sizes, so scaling behavior and hyperparameter sensitivity remain open questions. This paper is for researchers working on efficient MoE training in computer vision who already use upcycling and want a drop-in initialization tweak. It is worth sending to peer review because the idea is practical, the results are positive, and the gaps are fixable with more controls rather than fundamental.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cluster-aware Upcycling as an initialization strategy for Mixture-of-Experts (MoE) models derived from pretrained dense weights. It partitions the dense model's input activations into semantic clusters, initializes each expert from the truncated-SVD subspace of its cluster, sets the router weights to the cluster centroids, and adds an expert-ensemble self-distillation loss to stabilize training. The central claim is that this breaks expert symmetry, promotes early specialization aligned with the data distribution, and yields consistent outperformance over existing upcycling methods on CLIP ViT-B/32 and ViT-B/16 for both zero-shot and few-shot benchmarks, along with improved expert diversity, reduced inter-expert similarity, and more confident routing.

Significance. If the empirical gains are robust and causally attributable to the cluster-aware SVD initialization (rather than the self-distillation loss or generic regularization), the method would offer a practical, low-cost way to improve MoE specialization in vision models without training from scratch. It directly targets a known limitation of standard Sparse Upcycling. No machine-checked proofs or parameter-free derivations are present, but the project page is referenced as a potential source of further details.

major comments (2)

[Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.
[Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.

minor comments (2)

[Method] The description of how expert weights are constructed from the truncated-SVD subspaces (e.g., exact projection or scaling) could be clarified for reproducibility.
No mention of code or data release beyond the project page; adding a reproducibility statement would strengthen the submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.

Authors: We agree that the abstract would be strengthened by including quantitative results. In the revised manuscript we will update the abstract to report the key performance deltas (e.g., average zero-shot and few-shot accuracy gains on ViT-B/32 and ViT-B/16) together with references to the corresponding tables. The full paper already contains component-wise comparisons that separate the initialization from the self-distillation loss; we will make these distinctions more explicit in the abstract and add a short sentence summarizing the ablation evidence. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.

Authors: We acknowledge the value of the suggested controls. The current experiments demonstrate reduced inter-expert similarity and more confident routing under our initialization, but we did not include a random-clustering baseline or an explicit full-versus-truncated SVD comparison. In the revision we will add these ablations, along with a direct comparison of standard upcycling trained with only the self-distillation loss, to isolate the contribution of the cluster-aware SVD initialization and to confirm that the method does not require extensive hyperparameter search. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical initialization method with independent evaluation

full rationale

The paper proposes Cluster-aware Upcycling as a new MoE initialization procedure (partition activations, truncated SVD per cluster, router from centroids) plus a self-distillation loss, then reports empirical gains on CLIP zero-shot/few-shot benchmarks. No derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The central claims rest on experimental comparisons rather than algebraic identity or load-bearing self-reference. Minor self-citations to prior upcycling work are present but not used to justify uniqueness or forbid alternatives; they serve as background. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed initialization; no new physical entities are introduced. The method assumes standard linear-algebra tools suffice for semantic subspace capture.

axioms (1)

domain assumption Truncated SVD on cluster activations yields useful low-dimensional subspaces for expert weight initialization
Invoked when stating that each expert is initialized using the subspace representations of its corresponding cluster via truncated SVD.

pith-pipeline@v0.9.0 · 5533 in / 1343 out tokens · 29594 ms · 2026-05-10T14:20:42.770440+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

[1]

Qwen2.5-vl technical report.arXiv, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv, 2025. 8

work page 2025
[2]

Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022. 8

work page 2022
[3]

CLIP benchmark,

Mehdi Cherti and Romain Beaumont. CLIP benchmark,

work page
[4]

On the representation collapse of sparse mixture of experts.NeurIPS, 2022

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shum- ing Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts.NeurIPS, 2022. 1, 8

work page 2022
[5]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

work page
[6]

DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In ACL, 2024. 8

work page 2024
[7]

On the benefits of learn- ing to route in mixture-of-experts models

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Pani- grahy, Nikhil Vyas, and Xin Wang. On the benefits of learn- ing to route in mixture-of-experts models. InEMNLP, 2023. 8

work page 2023
[8]

The LLaMA 3 herd of models.arXiv, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLaMA 3 herd of models.arXiv, 2024. 1

work page 2024
[9]

Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment

Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xi- aoye Qu, Wei Wei, and Yu Cheng. Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment. InICML, 2025. 8

work page 2025
[10]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 1, 2, 7

work page 2022
[11]

Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024

Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet ¨Ust¨un. Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024. 1, 8

work page 2024
[12]

Delta decompres- sion for moe-based llms compression

Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. Delta decompres- sion for moe-based llms compression. InICML, 2025. 8

work page 2025
[13]

Advancing expert specialization for bet- ter moe.arXiv, 2025

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for bet- ter moe.arXiv, 2025. 8

work page 2025
[14]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026. 8

work page 2026
[15]

DeRS: Towards extremely efficient upcycled mixture-of-experts models

Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models. In CVPR, 2025. 5, 6, 8, 2

work page 2025
[16]

Mixtral of experts.arXiv, 2024

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv, 2024. 7

work page 2024
[17]

Mixture of lookup experts

Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi- Hong Deng, and Yunhe Wang. Mixture of lookup experts. In ICML, 2025. 8

work page 2025
[18]

Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019

Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019. 5

work page 2019
[19]

Scaling laws for neural language models.arXiv, 2020

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1

work page 2020
[20]

Sparse Upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Car- los Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse Upcycling: Training mixture-of-experts from dense checkpoints. In ICLR, 2023. 1, 2, 5, 6, 8

work page 2023
[21]

Gshard: Scaling giant mod- els with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant mod- els with conditional computation and automatic sharding. In ICLR, 2021. 1, 2, 7

work page 2021
[22]

MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition

Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition. InICML, 2025. 8

work page 2025
[23]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1

work page 2023
[24]

Scaling laws for upcycling mixture-of-experts language models

Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models. InICML,

work page
[25]

MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

work page
[26]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv, 2025. 8

work page 2025
[27]

A closer look into mixture-of-experts in large language mod- els

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language mod- els. InNAACL Findings, 2025. 7, 8, 2

work page 2025
[28]

Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajishirzi

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajish...

work page 2025
[29]

Learning to specialize with knowledge distillation for visual question answering

Jonghwan Mun, Kimin Lee, Jinwoo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. InNIPS, 2018. 2, 4

work page 2018
[30]

Drop-upcycling: Training sparse mixture of experts with partial re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization. In ICLR, 2025. 1, 5, 6, 8, 2

work page 2025
[31]

Tight clusters make specialized experts

Stefan Nielsen, Rachel Teo, Laziz Abdullaev, and Tan Minh Nguyen. Tight clusters make specialized experts. InICLR,

work page
[32]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 5

work page 2021
[33]

Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, 2022. 5, 8

work page 2022
[34]

Scaling vision with sparse mix- ture of experts.NeurIPS, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 7

work page 2021
[35]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, 2021. 5

work page 2021
[36]

Branch-train- mix: Mixing expert llms into a mixture-of-experts llm

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Roziere, Jacob Kahn, Shang- Wen Li, Wen-tau Yih, Jason E Weston, et al. Branch-train- mix: Mixing expert llms into a mixture-of-experts llm. In CoLM, 2024. 1, 8

work page 2024
[37]

Kimi-vl technical report.arXiv,

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv,

work page
[38]

Qwen2 technical report.arXiv, 2024

Qwen Team et al. Qwen2 technical report.arXiv, 2024. 1, 8

work page 2024
[39]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024. 8

work page 2024
[40]

SVD-LLM V2: Optimizing singular value truncation for large language model compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM V2: Optimizing singular value truncation for large language model compression. InNAACL, 2025. 4

work page 2025
[41]

Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025

Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, and Xianzhi Du. Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025. 7, 8

work page 2025
[42]

SVD- LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD- LLM: Truncation-aware singular value decomposition for large language model compression. InICLR, 2025. 4

work page 2025
[43]

Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L ¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

work page
[44]

Mimo-v2-Flash technical report.arXiv,

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-Flash technical report.arXiv,

work page
[45]

Qwen3 technical report.arXiv,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv,

work page
[46]

MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation

Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, and Hongteng Xu. MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation. In NeurIPS, 2025. 8

work page 2025
[47]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025. 8

work page 2025
[48]

CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling

Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling. InEMNLP, 2025. 5, 6, 8, 2

work page 2025
[49]

ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022. 7 Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling Supplementary Material A. Additional Quantitative Results To complement the comparisons presented...

work page 2022

[1] [1]

Qwen2.5-vl technical report.arXiv, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv, 2025. 8

work page 2025

[2] [2]

Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022. 8

work page 2022

[3] [3]

CLIP benchmark,

Mehdi Cherti and Romain Beaumont. CLIP benchmark,

work page

[4] [4]

On the representation collapse of sparse mixture of experts.NeurIPS, 2022

Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shum- ing Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts.NeurIPS, 2022. 1, 8

work page 2022

[5] [5]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

work page

[6] [6]

DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In ACL, 2024. 8

work page 2024

[7] [7]

On the benefits of learn- ing to route in mixture-of-experts models

Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Pani- grahy, Nikhil Vyas, and Xin Wang. On the benefits of learn- ing to route in mixture-of-experts models. InEMNLP, 2023. 8

work page 2023

[8] [8]

The LLaMA 3 herd of models.arXiv, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLaMA 3 herd of models.arXiv, 2024. 1

work page 2024

[9] [9]

Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment

Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xi- aoye Qu, Wei Wei, and Yu Cheng. Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment. InICML, 2025. 8

work page 2025

[10] [10]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 1, 2, 7

work page 2022

[11] [11]

Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024

Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet ¨Ust¨un. Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024. 1, 8

work page 2024

[12] [12]

Delta decompres- sion for moe-based llms compression

Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. Delta decompres- sion for moe-based llms compression. InICML, 2025. 8

work page 2025

[13] [13]

Advancing expert specialization for bet- ter moe.arXiv, 2025

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for bet- ter moe.arXiv, 2025. 8

work page 2025

[14] [14]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026. 8

work page 2026

[15] [15]

DeRS: Towards extremely efficient upcycled mixture-of-experts models

Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models. In CVPR, 2025. 5, 6, 8, 2

work page 2025

[16] [16]

Mixtral of experts.arXiv, 2024

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv, 2024. 7

work page 2024

[17] [17]

Mixture of lookup experts

Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi- Hong Deng, and Yunhe Wang. Mixture of lookup experts. In ICML, 2025. 8

work page 2025

[18] [18]

Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019

Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019. 5

work page 2019

[19] [19]

Scaling laws for neural language models.arXiv, 2020

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1

work page 2020

[20] [20]

Sparse Upcycling: Training mixture-of-experts from dense checkpoints

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Car- los Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse Upcycling: Training mixture-of-experts from dense checkpoints. In ICLR, 2023. 1, 2, 5, 6, 8

work page 2023

[21] [21]

Gshard: Scaling giant mod- els with conditional computation and automatic sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant mod- els with conditional computation and automatic sharding. In ICLR, 2021. 1, 2, 7

work page 2021

[22] [22]

MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition

Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition. InICML, 2025. 8

work page 2025

[23] [23]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1

work page 2023

[24] [24]

Scaling laws for upcycling mixture-of-experts language models

Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models. InICML,

work page

[25] [25]

MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

work page

[26] [26]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv, 2025. 8

work page 2025

[27] [27]

A closer look into mixture-of-experts in large language mod- els

Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language mod- els. InNAACL Findings, 2025. 7, 8, 2

work page 2025

[28] [28]

Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajishirzi

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajish...

work page 2025

[29] [29]

Learning to specialize with knowledge distillation for visual question answering

Jonghwan Mun, Kimin Lee, Jinwoo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. InNIPS, 2018. 2, 4

work page 2018

[30] [30]

Drop-upcycling: Training sparse mixture of experts with partial re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization. In ICLR, 2025. 1, 5, 6, 8, 2

work page 2025

[31] [31]

Tight clusters make specialized experts

Stefan Nielsen, Rachel Teo, Laziz Abdullaev, and Tan Minh Nguyen. Tight clusters make specialized experts. InICLR,

work page

[32] [32]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 5

work page 2021

[33] [33]

Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale

Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, 2022. 5, 8

work page 2022

[34] [34]

Scaling vision with sparse mix- ture of experts.NeurIPS, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 7

work page 2021

[35] [35]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, 2021. 5

work page 2021

[36] [36]

Branch-train- mix: Mixing expert llms into a mixture-of-experts llm

Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Roziere, Jacob Kahn, Shang- Wen Li, Wen-tau Yih, Jason E Weston, et al. Branch-train- mix: Mixing expert llms into a mixture-of-experts llm. In CoLM, 2024. 1, 8

work page 2024

[37] [37]

Kimi-vl technical report.arXiv,

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv,

work page

[38] [38]

Qwen2 technical report.arXiv, 2024

Qwen Team et al. Qwen2 technical report.arXiv, 2024. 1, 8

work page 2024

[39] [39]

Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024

Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024. 8

work page 2024

[40] [40]

SVD-LLM V2: Optimizing singular value truncation for large language model compression

Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM V2: Optimizing singular value truncation for large language model compression. InNAACL, 2025. 4

work page 2025

[41] [41]

Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025

Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, and Xianzhi Du. Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025. 7, 8

work page 2025

[42] [42]

SVD- LLM: Truncation-aware singular value decomposition for large language model compression

Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD- LLM: Truncation-aware singular value decomposition for large language model compression. InICLR, 2025. 4

work page 2025

[43] [43]

Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L ¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

work page

[44] [44]

Mimo-v2-Flash technical report.arXiv,

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-Flash technical report.arXiv,

work page

[45] [45]

Qwen3 technical report.arXiv,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv,

work page

[46] [46]

MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation

Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, and Hongteng Xu. MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation. In NeurIPS, 2025. 8

work page 2025

[47] [47]

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025. 8

work page 2025

[48] [48]

CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling

Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling. InEMNLP, 2025. 5, 6, 8, 2

work page 2025

[49] [49]

ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022. 7 Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling Supplementary Material A. Additional Quantitative Results To complement the comparisons presented...

work page 2022