pith. sign in

arxiv: 2604.13508 · v2 · submitted 2026-04-15 · 💻 cs.CV

Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Pith reviewed 2026-05-10 14:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords Mixture of ExpertsSparse UpcyclingExpert SpecializationModel InitializationCLIP Vision TransformersSelf-DistillationRouting Behavior
0
0 comments X

The pith

Cluster-aware Upcycling initializes MoE experts from semantic activation clusters to break symmetry and accelerate specialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the symmetry problem in sparse upcycling, where all experts in a Mixture-of-Experts model begin with identical weights copied from a pretrained dense network. It partitions the dense model's input activations into semantic clusters, then initializes each expert using the truncated SVD subspace of activations from its assigned cluster while setting the router weights to the corresponding cluster centroids. An expert-ensemble self-distillation loss is added to provide stable routing signals during early training. On CLIP ViT-B/32 and ViT-B/16 models, this produces higher zero-shot and few-shot accuracy than prior upcycling baselines while yielding more diverse expert representations and more decisive routing.

Core claim

By partitioning dense-model activations into semantic clusters and initializing each expert from the truncated SVD subspace of its cluster (with router weights set to cluster centroids), Cluster-aware Upcycling breaks the initial symmetry among experts and aligns their early specialization with the underlying data distribution, which in turn yields more diverse and disentangled representations together with improved downstream performance.

What carries the argument

Cluster-aware initialization: semantic partitioning of dense activations followed by truncated SVD subspaces for expert weights and centroids for the router, plus an expert-ensemble self-distillation loss.

If this is right

  • MoE models reach higher zero-shot and few-shot accuracy on vision tasks than standard sparse upcycling.
  • Expert representations become more diverse and less similar to one another.
  • The router exhibits more confident, less uniform routing decisions.
  • Training stability improves through the added self-distillation signal without changing the final architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cluster-derived initialization could shorten the number of training steps needed before MoE models surpass dense baselines.
  • The method may transfer to language-model upcycling if token embeddings are clustered instead of image activations.
  • Dynamic re-clustering of activations during later training stages could further reduce expert interference.

Load-bearing premise

Partitioning activations into semantic clusters and seeding experts with the corresponding SVD subspaces will produce stable early specialization that matches the data distribution without new instabilities or heavy hyperparameter tuning.

What would settle it

A controlled run in which the same dense model is upcycled using random subspaces instead of cluster-derived SVD subspaces and the resulting zero-shot accuracy on the CLIP benchmarks is statistically indistinguishable from the cluster-aware version.

Figures

Figures reproduced from arXiv: 2604.13508 by Bohyung Han, Gwangmo Song, Honglak Lee, Pyunghwan Ahn, Sanghyeok Chu, SeungHwan Kim.

Figure 1
Figure 1. Figure 1: Comparison of Sparse Upcycling and Cluster-aware Up [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of how Cluster-aware Upcycling initializes the MoE layer. (a) Input activations are clustered to obtain whitening [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Expert-ensemble self-distillation (EESD). The dense [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Relative Compactness measures the overlap between intra- and inter-expert variance, where lower values indicate more [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Detailed expert utilization across mixture-of-experts lay [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model's input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router's initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior. Project page: https://sanghyeokchu.github.io/cluster-aware-upcycling/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Cluster-aware Upcycling as an initialization strategy for Mixture-of-Experts (MoE) models derived from pretrained dense weights. It partitions the dense model's input activations into semantic clusters, initializes each expert from the truncated-SVD subspace of its cluster, sets the router weights to the cluster centroids, and adds an expert-ensemble self-distillation loss to stabilize training. The central claim is that this breaks expert symmetry, promotes early specialization aligned with the data distribution, and yields consistent outperformance over existing upcycling methods on CLIP ViT-B/32 and ViT-B/16 for both zero-shot and few-shot benchmarks, along with improved expert diversity, reduced inter-expert similarity, and more confident routing.

Significance. If the empirical gains are robust and causally attributable to the cluster-aware SVD initialization (rather than the self-distillation loss or generic regularization), the method would offer a practical, low-cost way to improve MoE specialization in vision models without training from scratch. It directly targets a known limitation of standard Sparse Upcycling. No machine-checked proofs or parameter-free derivations are present, but the project page is referenced as a potential source of further details.

major comments (2)
  1. [Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.
  2. [Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.
minor comments (2)
  1. [Method] The description of how expert weights are constructed from the truncated-SVD subspaces (e.g., exact projection or scaling) could be clarified for reproducibility.
  2. No mention of code or data release beyond the project page; adding a reproducibility statement would strengthen the submission.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that Cluster-aware Upcycling 'consistently outperforms existing methods across both zero-shot and few-shot benchmarks' is load-bearing for the paper's contribution, yet the abstract provides no quantitative results, error bars, ablation controls, or statistical tests. This makes it impossible to verify whether the reported gains arise from the semantic clustering + truncated-SVD subspaces or from the added self-distillation loss.

    Authors: We agree that the abstract would be strengthened by including quantitative results. In the revised manuscript we will update the abstract to report the key performance deltas (e.g., average zero-shot and few-shot accuracy gains on ViT-B/32 and ViT-B/16) together with references to the corresponding tables. The full paper already contains component-wise comparisons that separate the initialization from the self-distillation loss; we will make these distinctions more explicit in the abstract and add a short sentence summarizing the ablation evidence. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation (implied by the abstract's benchmark claims): the weakest assumption—that partitioning activations into semantic clusters and initializing via truncated SVD reliably produces early expert specialization without instabilities or heavy hyperparameter tuning—requires explicit ablations (e.g., random clustering baseline, full vs. truncated SVD, or comparison to standard upcycling with only the distillation loss). Without these, the headline improvements on CLIP benchmarks cannot be confidently attributed to the proposed initialization.

    Authors: We acknowledge the value of the suggested controls. The current experiments demonstrate reduced inter-expert similarity and more confident routing under our initialization, but we did not include a random-clustering baseline or an explicit full-versus-truncated SVD comparison. In the revision we will add these ablations, along with a direct comparison of standard upcycling trained with only the self-distillation loss, to isolate the contribution of the cluster-aware SVD initialization and to confirm that the method does not require extensive hyperparameter search. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical initialization method with independent evaluation

full rationale

The paper proposes Cluster-aware Upcycling as a new MoE initialization procedure (partition activations, truncated SVD per cluster, router from centroids) plus a self-distillation loss, then reports empirical gains on CLIP zero-shot/few-shot benchmarks. No derivation chain exists that reduces a claimed result to its own fitted parameters or self-citations by construction. The central claims rest on experimental comparisons rather than algebraic identity or load-bearing self-reference. Minor self-citations to prior upcycling work are present but not used to justify uniqueness or forbid alternatives; they serve as background. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the proposed initialization; no new physical entities are introduced. The method assumes standard linear-algebra tools suffice for semantic subspace capture.

axioms (1)
  • domain assumption Truncated SVD on cluster activations yields useful low-dimensional subspaces for expert weight initialization
    Invoked when stating that each expert is initialized using the subspace representations of its corresponding cluster via truncated SVD.

pith-pipeline@v0.9.0 · 5533 in / 1343 out tokens · 29594 ms · 2026-05-10T14:20:42.770440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Qwen2.5-vl technical report.arXiv, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv, 2025. 8

  2. [2]

    Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022

    Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding the mixture-of-experts layer in deep learning.NeurIPS, 2022. 8

  3. [3]

    CLIP benchmark,

    Mehdi Cherti and Romain Beaumont. CLIP benchmark,

  4. [4]

    On the representation collapse of sparse mixture of experts.NeurIPS, 2022

    Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shum- ing Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of experts.NeurIPS, 2022. 1, 8

  5. [5]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv,

  6. [6]

    DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models

    Damai Dai, Chengqi Deng, Chenggang Zhao, Rx Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. In ACL, 2024. 8

  7. [7]

    On the benefits of learn- ing to route in mixture-of-experts models

    Nishanth Dikkala, Nikhil Ghosh, Raghu Meka, Rina Pani- grahy, Nikhil Vyas, and Xin Wang. On the benefits of learn- ing to route in mixture-of-experts models. InEMNLP, 2023. 8

  8. [8]

    The LLaMA 3 herd of models.arXiv, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLaMA 3 herd of models.arXiv, 2024. 1

  9. [9]

    Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment

    Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xi- aoye Qu, Wei Wei, and Yu Cheng. Make LoRA great again: Boosting LoRA with adaptive singular values and mixture- of-experts optimization alignment. InICML, 2025. 8

  10. [10]

    Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.JMLR, 2022. 1, 2, 7

  11. [11]

    Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024

    Nikolas Gritsch, Qizhen Zhang, Acyr Locatelli, Sara Hooker, and Ahmet ¨Ust¨un. Nexus: Specialization meets adaptability for efficiently training mixture of experts.arXiv, 2024. 1, 8

  12. [12]

    Delta decompres- sion for moe-based llms compression

    Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. Delta decompres- sion for moe-based llms compression. InICML, 2025. 8

  13. [13]

    Advancing expert specialization for bet- ter moe.arXiv, 2025

    Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for bet- ter moe.arXiv, 2025. 8

  14. [14]

    Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026

    Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv, 2026. 8

  15. [15]

    DeRS: Towards extremely efficient upcycled mixture-of-experts models

    Yongqi Huang, Peng Ye, Chenyu Huang, Jianjian Cao, Lin Zhang, Baopu Li, Gang Yu, and Tao Chen. DeRS: Towards extremely efficient upcycled mixture-of-experts models. In CVPR, 2025. 5, 6, 8, 2

  16. [16]

    Mixtral of experts.arXiv, 2024

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv, 2024. 7

  17. [17]

    Mixture of lookup experts

    Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi- Hong Deng, and Yunhe Wang. Mixture of lookup experts. In ICML, 2025. 8

  18. [18]

    Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019

    Jeff Johnson, Matthijs Douze, and Herv ´e J ´egou. Billion- scale similarity search with GPUs.IEEE Transactions on Big Data, 2019. 5

  19. [19]

    Scaling laws for neural language models.arXiv, 2020

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv, 2020. 1

  20. [20]

    Sparse Upcycling: Training mixture-of-experts from dense checkpoints

    Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Car- los Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse Upcycling: Training mixture-of-experts from dense checkpoints. In ICLR, 2023. 1, 2, 5, 6, 8

  21. [21]

    Gshard: Scaling giant mod- els with conditional computation and automatic sharding

    Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant mod- els with conditional computation and automatic sharding. In ICLR, 2021. 1, 2, 7

  22. [22]

    MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition

    Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G Lee, Shengjie Sun, Wei Xue, and Yike Guo. MoE-SVD: Struc- tured mixture-of-experts llms compression via singular value decomposition. InICML, 2025. 8

  23. [23]

    Scaling language-image pre-training via masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1

  24. [24]

    Scaling laws for upcycling mixture-of-experts language models

    Seng Pei Liew, Takuya Kato, and Sho Takase. Scaling laws for upcycling mixture-of-experts language models. InICML,

  25. [25]

    MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

    Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, and Li Yuan. MoE-LLaV A: Mixture of experts for large vision-language models.CoRR,

  26. [26]

    Deepseek-v3

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv, 2025. 8

  27. [27]

    A closer look into mixture-of-experts in large language mod- els

    Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, and Jie Fu. A closer look into mixture-of-experts in large language mod- els. InNAACL Findings, 2025. 7, 8, 2

  28. [28]

    Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajishirzi

    Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Evan Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh, Aman- preet Singh, and Hannaneh Hajish...

  29. [29]

    Learning to specialize with knowledge distillation for visual question answering

    Jonghwan Mun, Kimin Lee, Jinwoo Shin, and Bohyung Han. Learning to specialize with knowledge distillation for visual question answering. InNIPS, 2018. 2, 4

  30. [30]

    Drop-upcycling: Training sparse mixture of experts with partial re-initialization

    Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, and Jun Suzuki. Drop-upcycling: Training sparse mixture of experts with partial re-initialization. In ICLR, 2025. 1, 5, 6, 8, 2

  31. [31]

    Tight clusters make specialized experts

    Stefan Nielsen, Rachel Teo, Laziz Abdullaev, and Tan Minh Nguyen. Tight clusters make specialized experts. InICLR,

  32. [32]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 5

  33. [33]

    Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale

    Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. Deepspeed-moe: Advanc- ing mixture-of-experts inference and training to power next- generation ai scale. InICML, 2022. 5, 8

  34. [34]

    Scaling vision with sparse mix- ture of experts.NeurIPS, 2021

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts.NeurIPS, 2021. 7

  35. [35]

    Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

    Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Je- nia Jitsev, Theo Coombes, and Clayton Mullis. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshop Datacentric AI, 2021. 5

  36. [36]

    Branch-train- mix: Mixing expert llms into a mixture-of-experts llm

    Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Roziere, Jacob Kahn, Shang- Wen Li, Wen-tau Yih, Jason E Weston, et al. Branch-train- mix: Mixing expert llms into a mixture-of-experts llm. In CoLM, 2024. 1, 8

  37. [37]

    Kimi-vl technical report.arXiv,

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chen- zhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv,

  38. [38]

    Qwen2 technical report.arXiv, 2024

    Qwen Team et al. Qwen2 technical report.arXiv, 2024. 1, 8

  39. [39]

    Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024

    Lean Wang, Huazuo Gao, Chenggang Zhao, Xu Sun, and Damai Dai. Auxiliary-loss-free load balancing strategy for mixture-of-experts.arXiv, 2024. 8

  40. [40]

    SVD-LLM V2: Optimizing singular value truncation for large language model compression

    Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, and Mi Zhang. SVD-LLM V2: Optimizing singular value truncation for large language model compression. InNAACL, 2025. 4

  41. [41]

    Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025

    Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, and Xianzhi Du. Clip-up: A simple and efficient mixture-of-experts clip train- ing recipe with sparse upcycling.arXiv, 2025. 7, 8

  42. [42]

    SVD- LLM: Truncation-aware singular value decomposition for large language model compression

    Xin Wang, Yu Zheng, Zhongwei Wan, and Mi Zhang. SVD- LLM: Truncation-aware singular value decomposition for large language model compression. InICLR, 2025. 4

  43. [43]

    Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

    Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei L ¨u, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, et al. Skywork-MoE: A deep dive into training techniques for mixture-of-experts language models.arXiv,

  44. [44]

    Mimo-v2-Flash technical report.arXiv,

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-Flash technical report.arXiv,

  45. [45]

    Qwen3 technical report.arXiv,

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv,

  46. [46]

    MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation

    Shen Yuan, Yin Zheng, Taifeng Wang, Binbin Liu, and Hongteng Xu. MoORE: SVD-based model MoE-ization for conflict- and oblivion-resistant multi-task adaptation. In NeurIPS, 2025. 8

  47. [47]

    Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv, 2025. 8

  48. [48]

    CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling

    Jihai Zhang, Xiaoye Qu, Tong Zhu, and Yu Cheng. CLIP- MoE: Towards building mixture of experts for clip with di- versified multiplet upcycling. InEMNLP, 2025. 5, 6, 8, 2

  49. [49]

    ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022

    Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST- MoE: Designing stable and transferable sparse expert mod- els.arXiv, 2022. 7 Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling Supplementary Material A. Additional Quantitative Results To complement the comparisons presented...