Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling
Pith reviewed 2026-05-10 14:20 UTC · model grok-4.3
The pith
Cluster-aware Upcycling initializes MoE experts from semantic activation clusters to break symmetry and accelerate specialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By partitioning dense-model activations into semantic clusters and initializing each expert from the truncated SVD subspace of its cluster (with router weights set to cluster centroids), Cluster-aware Upcycling breaks the initial symmetry among experts and aligns their early specialization with the underlying data distribution, which in turn yields more diverse and disentangled representations together with improved downstream performance.
What carries the argument
Cluster-aware initialization: semantic partitioning of dense activations followed by truncated SVD subspaces for expert weights and centroids for the router, plus an expert-ensemble self-distillation loss.
If this is right
- MoE models reach higher zero-shot and few-shot accuracy on vision tasks than standard sparse upcycling.
- Expert representations become more diverse and less similar to one another.
- The router exhibits more confident, less uniform routing decisions.
- Training stability improves through the added self-distillation signal without changing the final architecture.
Where Pith is reading between the lines
- The same cluster-derived initialization could shorten the number of training steps needed before MoE models surpass dense baselines.
- The method may transfer to language-model upcycling if token embeddings are clustered instead of image activations.
- Dynamic re-clustering of activations during later training stages could further reduce expert interference.
Load-bearing premise
Partitioning activations into semantic clusters and seeding experts with the corresponding SVD subspaces will produce stable early specialization that matches the data distribution without new instabilities or heavy hyperparameter tuning.
What would settle it
A controlled run in which the same dense model is upcycled using random subspaces instead of cluster-derived SVD subspaces and the resulting zero-shot accuracy on the CLIP benchmarks is statistically indistinguishable from the cluster-aware version.
Figures
read the original abstract
Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior. Project page: https://sanghyeokchu.github.io/cluster-aware-upcycling/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Cluster-aware Upcycling as an initialization strategy for Mixture-of-Experts (MoE) models derived from pretrained dense weights. It partitions the dense model's input activations into semantic clusters, initializes each expert from the truncated-SVD subspace of its cluster, sets the router weights to the cluster centroids, and adds an expert-ensemble self-distillation loss to stabilize training. The central claim is that this breaks expert symmetry, promotes early specialization aligned with the data distribution, and yields consistent outperformance over existing upcycling methods on CLIP ViT-B/32 and ViT-B/16 for both zero-shot and few-shot benchmarks, along with improved expert diversity, reduced inter-expert similarity, and more confident routing.
Significance. If the empirical gains are robust and causally attributable to the cluster-aware SVD initialization (rather than the self-distillation loss or generic regularization), the method would offer a practical, low-cost way to improve MoE specialization in vision models without training from scratch. It directly targets a known limitation of standard Sparse Upcycling. No machine-checked proofs or parameter-free derivations are present, but the project page is referenced as a potential source of further details.
major comments (2)
- [Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.
- [Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.
minor comments (2)
- [Method] The description of how expert weights are constructed from the truncated-SVD subspaces (e.g., exact projection or scaling) could be clarified for reproducibility.
- No mention of code or data release beyond the project page; adding a reproducibility statement would strengthen the submission.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.
Authors: We agree that the abstract would be strengthened by including quantitative results. In the revised manuscript we will update the abstract to report the key performance deltas (e.g., average zero-shot and few-shot accuracy gains on ViT-B/32 and ViT-B/16) together with references to the corresponding tables. The full paper already contains component-wise comparisons that separate the initialization from the self-distillation loss; we will make these distinctions more explicit in the abstract and add a short sentence summarizing the ablation evidence. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.
Authors: We acknowledge the value of the suggested controls. The current experiments demonstrate reduced inter-expert similarity and more confident routing under our initialization, but we did not include a random-clustering baseline or an explicit full-versus-truncated SVD comparison. In the revision we will add these ablations, along with a direct comparison of standard upcycling trained with only the self-distillation loss, to isolate the contribution of the cluster-aware SVD initialization and to confirm that the method does not require extensive hyperparameter search. revision: yes
Circularity Check
No circularity: empirical initialization method with independent evaluation
full rationale
The paper proposes Cluster-aware Upcycling as a new MoE initialization procedure (partition activations, truncated SVD per cluster, router from centroids) plus a self-distillation loss, then reports empirical gains on CLIP zero-shot/few-shot benchmarks. No derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The central claims rest on experimental comparisons rather than algebraic identity or load-bearing self-reference. Minor self-citations to prior upcycling work are present but not used to justify uniqueness or forbid alternatives; they serve as background. This is the common case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Truncated SVD on cluster activations yields useful low-dimensional subspaces for expert weight initialization
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report.arXiv, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv, 2025. 8
work page 2025
-
[2]
Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022
Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022. 8
work page 2022
- [3]
-
[4]
On the representation collapse of sparse mixture of experts.NeurIPS, 2022
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shum- ing Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts.NeurIPS, 2022. 1, 8
work page 2022
-
[5]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,
-
[6]
DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In ACL, 2024. 8
work page 2024
-
[7]
On the benefits of learn- ing to route in mixture-of-experts models
Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Pani- grahy, Nikhil Vyas, and Xin Wang. On the benefits of learn- ing to route in mixture-of-experts models. InEMNLP, 2023. 8
work page 2023
-
[8]
The LLaMA 3 herd of models.arXiv, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLaMA 3 herd of models.arXiv, 2024. 1
work page 2024
-
[9]
Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xi- aoye Qu, Wei Wei, and Yu Cheng. Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment. InICML, 2025. 8
work page 2025
-
[10]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 1, 2, 7
work page 2022
-
[11]
Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024
Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet ¨Ust¨un. Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024. 1, 8
work page 2024
-
[12]
Delta decompres- sion for moe-based llms compression
Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. Delta decompres- sion for moe-based llms compression. InICML, 2025. 8
work page 2025
-
[13]
Advancing expert specialization for bet- ter moe.arXiv, 2025
Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for bet- ter moe.arXiv, 2025. 8
work page 2025
-
[14]
Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026. 8
work page 2026
-
[15]
DeRS: Towards extremely efficient upcycled mixture-of-experts models
Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models. In CVPR, 2025. 5, 6, 8, 2
work page 2025
-
[16]
Mixtral of experts.arXiv, 2024
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv, 2024. 7
work page 2024
-
[17]
Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi- Hong Deng, and Yunhe Wang. Mixture of lookup experts. In ICML, 2025. 8
work page 2025
-
[18]
Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019
Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019. 5
work page 2019
-
[19]
Scaling laws for neural language models.arXiv, 2020
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1
work page 2020
-
[20]
Sparse Upcycling: Training mixture-of-experts from dense checkpoints
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Car- los Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse Upcycling: Training mixture-of-experts from dense checkpoints. In ICLR, 2023. 1, 2, 5, 6, 8
work page 2023
-
[21]
Gshard: Scaling giant mod- els with conditional computation and automatic sharding
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant mod- els with conditional computation and automatic sharding. In ICLR, 2021. 1, 2, 7
work page 2021
-
[22]
MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition
Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition. InICML, 2025. 8
work page 2025
-
[23]
Scaling language-image pre-training via masking
Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1
work page 2023
-
[24]
Scaling laws for upcycling mixture-of-experts language models
Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models. InICML,
-
[25]
MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,
-
[26]
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv, 2025. 8
work page 2025
-
[27]
A closer look into mixture-of-experts in large language mod- els
Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language mod- els. InNAACL Findings, 2025. 7, 8, 2
work page 2025
-
[28]
Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajishirzi
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajish...
work page 2025
-
[29]
Learning to specialize with knowledge distillation for visual question answering
Jonghwan Mun, Kimin Lee, Jinwoo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. InNIPS, 2018. 2, 4
work page 2018
-
[30]
Drop-upcycling: Training sparse mixture of experts with partial re-initialization
Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization. In ICLR, 2025. 1, 5, 6, 8, 2
work page 2025
-
[31]
Tight clusters make specialized experts
Stefan Nielsen, Rachel Teo, Laziz Abdullaev, and Tan Minh Nguyen. Tight clusters make specialized experts. InICLR,
-
[32]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 5
work page 2021
-
[33]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, 2022. 5, 8
work page 2022
-
[34]
Scaling vision with sparse mix- ture of experts.NeurIPS, 2021
Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 7
work page 2021
-
[35]
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs
Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, 2021. 5
work page 2021
-
[36]
Branch-train- mix: Mixing expert llms into a mixture-of-experts llm
Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Roziere, Jacob Kahn, Shang- Wen Li, Wen-tau Yih, Jason E Weston, et al. Branch-train- mix: Mixing expert llms into a mixture-of-experts llm. In CoLM, 2024. 1, 8
work page 2024
-
[37]
Kimi-vl technical report.arXiv,
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv,
-
[38]
Qwen2 technical report.arXiv, 2024
Qwen Team et al. Qwen2 technical report.arXiv, 2024. 1, 8
work page 2024
-
[39]
Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024
Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024. 8
work page 2024
-
[40]
SVD-LLM V2: Optimizing singular value truncation for large language model compression
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM V2: Optimizing singular value truncation for large language model compression. InNAACL, 2025. 4
work page 2025
-
[41]
Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, and Xianzhi Du. Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025. 7, 8
work page 2025
-
[42]
SVD- LLM: Truncation-aware singular value decomposition for large language model compression
Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD- LLM: Truncation-aware singular value decomposition for large language model compression. InICLR, 2025. 4
work page 2025
-
[43]
Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,
Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L ¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,
-
[44]
Mimo-v2-Flash technical report.arXiv,
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-Flash technical report.arXiv,
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv,
-
[46]
MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation
Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, and Hongteng Xu. MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation. In NeurIPS, 2025. 8
work page 2025
-
[47]
Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025
Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025. 8
work page 2025
-
[48]
CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling
Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling. InEMNLP, 2025. 5, 6, 8, 2
work page 2025
-
[49]
ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022. 7 Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling Supplementary Material A. Additional Quantitative Results To complement the comparisons presented...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.