pith. sign in

arxiv: 2606.26287 · v1 · pith:66LYS4RBnew · submitted 2026-06-24 · 💻 cs.CV

GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models

Pith reviewed 2026-06-26 01:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords Mixture of ExpertsAdaptive RoutingGating EntropyMinimum Description LengthLarge Vision-Language ModelsUncertainty EstimationModel EfficiencyDynamic Expert Selection
0
0 comments X

The pith

Connecting minimum description length to gating entropy lets MoE models choose a variable number of experts per token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames token routing in Mixture-of-Experts vision-language models as an information-encoding task that can be solved by minimizing description length. It shows that gating entropy directly measures the complexity term in that trade-off, so the entropy value itself can set how many experts each token activates. If this link holds, static Top-k routing can be replaced by an entropy threshold that keeps nearly all accuracy while raising average expert sparsity. The result matters because large models spend most of their inference cost on expert computation, and any method that trims unnecessary activations without retraining improves deployment cost.

Core claim

By validating the connection between minimum description length and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE). GeMoE uses the entropy of the gating distribution to assess token complexity and adaptively determines the number of experts each token should engage, explicitly modeling the trade-off between model complexity and performance. On a wide range of backbones and benchmarks this yields 99.5 percent average performance retention relative to static Top-k routing while improving average expert activation sparsity by 36.5 percent.

What carries the argument

Gating entropy computed from the router's softmax distribution, used as a direct proxy for the complexity penalty in a minimum-description-length formulation of routing.

If this is right

  • Each token receives a data-dependent expert count instead of a fixed k.
  • Average expert activation sparsity rises 36.5 percent without requiring changes to the trained router or experts.
  • The same entropy threshold works across multiple vision-language backbones and evaluation suites while retaining 99.5 percent of original performance.
  • Routing decisions become explicit uncertainty estimates rather than heuristic rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy signal could be monitored at inference time to decide whether to fall back to a smaller expert set on resource-constrained hardware.
  • If entropy truly tracks description length, similar thresholds might apply to other sparse architectures that route on softmax outputs.
  • Token-level entropy statistics collected during a forward pass could serve as a cheap diagnostic for which inputs the model finds most ambiguous.

Load-bearing premise

The numerical link between minimum description length and the entropy of the gating vector is tight enough that an entropy threshold produces an expert count that preserves task performance.

What would settle it

Run the same models and benchmarks with the entropy-derived threshold; if average accuracy falls more than a few percent below the static Top-k baseline while sparsity gains remain, the claimed MDL-to-entropy equivalence does not hold for routing.

Figures

Figures reproduced from arXiv: 2606.26287 by Chaoxiang Cai, Jie Li, Longrong Yang, Minghe Weng, Xi Li, Yibo Jiang, Zequn Qin.

Figure 1
Figure 1. Figure 1: Comparison of static routing Top-k MoE and our dynamic routing GeMoE: (a) Top-k assigns a fixed number of experts to each token. (b) GeMoE dynamically determines the number of experts to activate based on the token’s gating entropy. Gating entropy reflects the token’s uncertainty in expert selecting. Higher entropy indicates greater uncertainty, requiring more experts to be selected. up, their size and res… view at source ↗
Figure 2
Figure 2. Figure 2: (a). We frame dynamic routing as a MDL problem and demonstrate that gating entropy can serve as a proxy for decreasing the MDL. (b) Based on gating entropy, we model the number of experts required for a token as positively correlated with its gating entropy, allocating more experts to tokens with higher gating entropy. \mathcal {R}_{norm}(x)_i = \frac {{e^{\mathcal {R}(x)_i}}}{\sum _{j=1}^K {e^{\mathcal {R… view at source ↗
Figure 3
Figure 3. Figure 3: Average accuracy versus maximum-minimum normalized token routing entropy on ScienceQA [43] (4000 samples), evaluated using MolmoE-1B-7B. Samples are sorted by normalized token routing entropy and grouped into intervals of 200 samples. improves performance for data with high average gating entropy. However, for data with low average gating entropy, increasing the number of experts does not significantly imp… view at source ↗
Figure 4
Figure 4. Figure 4: Average accuracy of various Top￾k and dynamic routing methods [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the correlation between gating entropy and the number of activated experts on layer 4, 8, 12 and 16. function is robust to different weightings, with the best performance achieved on most datasets when α = 1.0. Average Number of Experts Activated in Each MoE Layer. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average number of tokens activated by experts. Larger circle, more activations. Overall, shallow layers activate fewer experts on average, while deeper layers activate more, consistent with the observations in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average number of experts activated in each MoE layer. C.2 Correlation between Entropy and k on Additional Datasets Across the three datasets as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Correlation between gating entropy and the number of activated experts. learned through our method exhibit clear distinctions. Such diversity is crucial in MoE architectures, as it promotes the specialization of individual experts and enhances the overall model capacity. C.4 Expert Routing Paths We present a visualization of the Top-2 activated pathways in [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Expert similarity within each MoE layer [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of activated pathways on Top-2 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Expert load across different modalities [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

With the increase in model parameters and training data, the instruction following and generalization capabilities of Large VisionLanguage Models (LVLMs) have been significantly improved. Based on the Mixture of Experts (MoE) architecture, LVLMs expand their parameter capacity while maintaining the inference cost. However, traditional MoE methods employ a Top-k static routing strategy, which fails to account for variations in the input and adaptively select the number of experts, resulting in suboptimal resource utilization. In this paper, we propose viewing token routing as an information encoding task, framing dynamic routing as a Minimum Description Length (MDL) problem in encoding By validating the connection between MDL and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE) for MoE. Unlike traditional static or heuristic-based dynamic routing methods, GeMoE explicitly models the trade-off between model complexity and performance. By using gating entropy to assess the complexity of tokens, GeMoE adaptively determines the number of experts each token should engage. On a wide range of backbones and benchmarks, our method achieves 99.5% average performance retention compared to the original static routing, while improving average expert activation sparsity by 36.5%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that viewing token routing in MoE-based LVLMs as an MDL problem allows validation of a connection to gating entropy, leading to GeMoE which adaptively selects the number of experts per token using this entropy measure. It reports achieving 99.5% average performance retention compared to static routing while improving expert activation sparsity by 36.5% on various backbones and benchmarks.

Significance. Should the MDL to gating entropy mapping be derived or validated with transparent steps and the performance claims hold under detailed scrutiny, the work would offer a novel uncertainty-aware routing strategy for efficient inference in large vision-language models. The sparsity gains with near-full performance retention would be of practical interest for scaling MoE architectures.

major comments (2)
  1. [Abstract] The validation of the MDL-gating entropy connection is stated without derivation steps, equations, or description of how the validation was performed. This is load-bearing because the adaptive rule relies on this link to replace static Top-k while preserving performance.
  2. [Abstract] The performance numbers (99.5% retention, 36.5% sparsity) are aggregate without error bars, ablation details, or per-experiment breakdowns, preventing assessment of whether the central claim is robust.
minor comments (1)
  1. Notation for gating entropy H(g) could be introduced earlier with a clear definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the MDL connection and experimental reporting.

read point-by-point responses
  1. Referee: [Abstract] The validation of the MDL-gating entropy connection is stated without derivation steps, equations, or description of how the validation was performed. This is load-bearing because the adaptive rule relies on this link to replace static Top-k while preserving performance.

    Authors: The full manuscript (Section 3) derives the link by framing routing as an MDL problem where gating entropy serves as a proxy for description length under uniform expert contribution assumptions, with the adaptive k chosen to minimize H(g) + λ·performance_loss. Validation combined a theoretical equivalence proof and empirical correlation (r > 0.85) on held-out tokens. To address the abstract-level concern, we will insert a concise derivation outline and validation description into the abstract. revision: yes

  2. Referee: [Abstract] The performance numbers (99.5% retention, 36.5% sparsity) are aggregate without error bars, ablation details, or per-experiment breakdowns, preventing assessment of whether the central claim is robust.

    Authors: We agree the aggregates alone limit scrutiny. The reported values are means across backbones and benchmarks; the revision will add error bars from multiple seeds, per-benchmark and per-backbone tables, and expanded ablations on entropy thresholds to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper frames token routing as an MDL problem and states that it validates a connection to gating entropy to motivate an adaptive threshold. No equations, self-citations, or fitted-parameter steps are exhibited in the provided text that reduce the claimed performance-preserving property or the entropy threshold back to the inputs by construction. The 99.5% retention and 36.5% sparsity figures are presented as empirical outcomes rather than tautological consequences of the router outputs themselves. This is the normal case of a heuristic reframing supported by experiments rather than a definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on an unproven link between MDL and gating entropy that is asserted to hold inside MoE models; no free parameters or new entities are named in the abstract, but the domain assumption itself is the load-bearing item.

axioms (1)
  • domain assumption A direct, usable connection exists between the minimum description length principle and the entropy of MoE gating probabilities that permits adaptive expert selection without performance loss.
    The abstract states that this connection is validated and then used to define GeMoE.

pith-pipeline@v0.9.1-grok · 5780 in / 1284 out tokens · 33761 ms · 2026-06-26T01:30:42.286585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 17 linked inside Pith

  1. [1]

    In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis

    Aminabadi, R.Y., Rajbhandari, S., Awan, A.A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al.: Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–15. IEEE (2022)

  2. [2]

    arXiv preprint arXiv:2309.16609 (2023)

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  3. [3]

    arXiv preprint arXiv:2308.129661(2), 3 (2023)

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

  4. [4]

    arXiv preprint arXiv:2511.21631 (2025)

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    arXiv preprint arXiv:2502.13923 (2025)

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  6. [6]

    Advances in neural information processing systems35, 32897– 32912 (2022)

    Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of- modality-experts. Advances in neural information processing systems35, 32897– 32912 (2022)

  7. [7]

    arXiv preprint arXiv:2507.01351 (2025)

    Cai, C., Yang, L., Chen, K., Yang, F., Li, X.: Long-tailed distribution-aware router for mixture-of-experts in large vision-language model. arXiv preprint arXiv:2507.01351 (2025)

  8. [8]

    arxiv 2023

    Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chan- dra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: Large language model as a uni- fied interface for vision-language multi-task learning. arxiv 2023. arXiv preprint arXiv:2310.09478 (2023)

  9. [9]

    Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)

  10. [10]

    arXiv preprint arXiv:2107.03374 (2021)

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  11. [11]

    In: International Con- ference on Machine Learning

    Chen, W., Zhou, Y., Du, N., Huang, Y., Laudon, J., Chen, Z., Cui, C.: Lifelong language pretraining with distribution-specialized experts. In: International Con- ference on Machine Learning. pp. 5383–5395. PMLR (2023) 16 C. Cai et al

  12. [12]

    arXiv preprint arXiv:2311.02684 (2023)

    Chen, Z., Wang, Z., Wang, Z., Liu, H., Yin, Z., Liu, S., Sheng, L., Ouyang, W., Qiao, Y., Shao, J.: Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684 (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  14. [14]

    arXiv preprint arXiv:2110.14168 (2021)

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  15. [15]

    arXiv preprint arXiv:2507.06261 (2025)

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  16. [16]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al.: Deepseekmoe: Towards ultimate expert specialization in mixture- of-experts language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1280– 1297 (2024)

  17. [17]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025)

  18. [18]

    In: International conference on machine learning

    Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., et al.: Glam: Efficient scaling of language models with mixture-of-experts. In: International conference on machine learning. pp. 5547–

  19. [19]

    arXiv preprint arXiv:2312.17238 (2023)

    Eliseev, A., Mazur, D.: Fast inference of mixture-of-experts language models with offloading. arXiv preprint arXiv:2312.17238 (2023)

  20. [20]

    Journal of Machine Learning Research 23(120), 1–39 (2022)

    Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23(120), 1–39 (2022)

  21. [21]

    Advances in Neural Information Processing Systems38(2026)

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)

  22. [22]

    arXiv preprint arXiv:2312.12379 (2023)

    Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.Y., Kwok, J.T., Zhang, Y.: Mixture of cluster-conditional lora experts for vision-language instruc- tion tuning. arXiv preprint arXiv:2312.12379 (2023)

  23. [23]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017)

  24. [24]

    arXiv preprint arXiv:2405.14297 (2024)

    Guo,Y.,Cheng,Z.,Tang,X.,Tu,Z.,Lin,T.:Dynamicmixtureofexperts:Anauto- tuning approach for efficient transformer models. arXiv preprint arXiv:2405.14297 (2024)

  25. [25]

    arXiv preprint arXiv:2009.03300 (2020)

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

  26. [26]

    In: Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers)

    Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y., Xu, K., Chen, L., Huang, S., Feng, Y.: Harder task needs more experts: Dynamic routing in moe GeMoE 17 models. In: Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 12883–12895 (2024)

  27. [27]

    In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)

  28. [28]

    Neural computation3(1), 79–87 (1991)

    Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation3(1), 79–87 (1991)

  29. [29]

    arXiv preprint arXiv:2401.04088 (2024)

    Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  30. [30]

    arXiv preprint arXiv:2410.07348 (2024)

    Jin, P., Zhu, B., Yuan, L., Yan, S.: Moe++: Accelerating mixture-of-experts meth- ods with zero-computation experts. arXiv preprint arXiv:2410.07348 (2024)

  31. [31]

    In: European conference on computer vision

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)

  32. [32]

    arXiv preprint arXiv:2212.05055 (2022)

    Komatsuzaki,A.,Puigcerver,J.,Lee-Thorp,J.,Ruiz,C.R.,Mustafa,B.,Ainslie,J., Tay, Y., Dehghani, M., Houlsby, N.: Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055 (2022)

  33. [33]

    arXiv preprint arXiv:2006.16668 (2020)

    Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)

  34. [34]

    In: Proceedings of the 2023 conference on empirical methods in natural language processing

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)

  35. [35]

    arXiv preprint arXiv:2511.12609 (2025)

    Li, Y., Chen, X., Jiang, S., Shi, H., Liu, Z., Zhang, X., Deng, N., Xu, Z., Ma, Y., Zhang, M., et al.: Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data. arXiv preprint arXiv:2511.12609 (2025)

  36. [36]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Li, Y., Jiang, S., Hu, B., Wang, L., Zhong, W., Luo, W., Ma, L., Zhang, M.: Uni- moe: Scaling unified multimodal llms with mixture of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  37. [37]

    IEEE Transactions on Multimedia (2026)

    Lin, B., Tang, Z., Ye, Y., Huang, J., Zhang, J., Pang, Y., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

  38. [38]

    arXiv preprint arXiv:2306.145652(3), 6 (2023)

    Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.145652(3), 6 (2023)

  39. [39]

    Advances in neural information processing systems36, 34892–34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

  40. [40]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)

  41. [41]

    Science China Information Sciences67(12), 220102 (2024)

    Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024)

  42. [42]

    arXiv preprint arXiv:2405.00361 (2024)

    Liu,Z.,Luo,J.:Adamole:Fine-tuninglargelanguagemodelswithadaptivemixture of low-rank adaptation experts. arXiv preprint arXiv:2405.00361 (2024)

  43. [43]

    Advances in neural information processing systems35, 2507– 2521 (2022) 18 C

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems35, 2507– 2521 (2022) 18 C. Cai et al

  44. [44]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022)

  45. [45]

    arXiv preprint arXiv:2409.02060 (2024)

    Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al.: Olmoe: Open mixture-of-experts language models. arXiv preprint arXiv:2409.02060 (2024)

  46. [46]

    arXiv preprint arXiv:2308.00951 (2023)

    Puigcerver,J.,Riquelme,C.,Mustafa,B.,Houlsby,N.:Fromsparsetosoftmixtures of experts. arXiv preprint arXiv:2308.00951 (2023)

  47. [47]

    OpenAI blog1(8), 9 (2019)

    Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

  48. [48]

    arXiv preprint arXiv:1701.06538 (2017)

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

  49. [49]

    arXiv preprint arXiv:2305.14705 (2023)

    Shen, S., Hou, L., Zhou, Y., Du, N., Longpre, S., Wei, J., Chung, H.W., Zoph, B., Fedus, W., Chen, X., et al.: Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705 (2023)

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)

  51. [51]

    Song, C., Zhao, W., Han, X., Xiao, C., Chen, Y., Li, Y., Liu, Z., Sun, M.: Blockffn: Towards end-side acceleration-friendly mixture-of-experts with chunk-level activa- tion sparsity (2025),https://arxiv.org/abs/2507.08771

  52. [52]

    In: Findings of the Association for Computational Linguistics: ACL 2023

    Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q., Chi, E., Zhou, D., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. In: Findings of the Association for Computational Linguistics: ACL 2023. pp. 13003–13051 (2023)

  53. [53]

    arXiv preprint arXiv:2412.14711 (2024)

    Wang, Z., Zhu, J., Chen, J.: Remoe: Fully differentiable mixture-of-experts with relu routing. arXiv preprint arXiv:2412.14711 (2024)

  54. [54]

    arXiv preprint arXiv:2406.06563 (2024)

    Wei, T., Zhu, B., Zhao, L., Cheng, C., Li, B., Lü, W., Cheng, P., Zhang, J., Zhang, X., Zeng, L., et al.: Skywork-moe: A deep dive into training techniques for mixture- of-experts language models. arXiv preprint arXiv:2406.06563 (2024)

  55. [55]

    arXiv preprint arXiv:2412.10302 (2024)

    Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

  56. [56]

    arXiv preprint arXiv:2402.01739 (2024)

    Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)

  57. [57]

    arXiv e-prints pp

    Xue, L., Fu, Y., Lu, Z., Mai, L., Marina, M.: Moe-infinity: Offloading-efficient moe model serving. arXiv e-prints pp. arXiv–2401 (2024)

  58. [58]

    arXiv preprint arXiv:2505.09388 (2025)

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  59. [59]

    arXiv preprint arXiv:2308.02490 (2023)

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)

  60. [60]

    In: The Thirteenth International Conference on Learning Representations (2024) GeMoE 19

    Yue, T., Guo, L., Cheng, J., Gao, X., Huang, H., Liu, J.: Ada-k routing: Boosting the efficiency of moe-based llms. In: The Thirteenth International Conference on Learning Representations (2024) GeMoE 19

  61. [61]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)

  62. [62]

    In: Findings of the Association for Computational Linguistics: EMNLP 2024

    Zeng, Z., Miao, Y., Gao, H., Zhang, H., Deng, Z.: Adamoe: Token-adaptive rout- ing with null experts for mixture-of-experts language models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 6223–6235 (2024)

  63. [63]

    arXiv preprint arXiv:2306.17107 (2023)

    Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., Sun, T.: Llavar: En- hanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107 (2023)

  64. [64]

    In: Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

    Zhong, S., Gao, S., Huang, Z., Wen, W., Žitnik, M., Zhou, P.: Moextend: Tuning new experts for modality and task extension. In: Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). pp. 494–505 (2024)

  65. [65]

    Zhou, Y., Zhao, Z., Li, H., Du, S., Yao, J., Zhang, Y., Wang, Y.: Exploring training on heterogeneous data with mixture of low-rank adapters. arXiv preprint arXiv:2406.09679 (2024) A Methodological Details To provide a clearer description of our proposed method, we first present a detailed explanation of the symbols mentioned in the Methodology section, a...