GeMoE: Gating Entropy is All You Need for Uncertainty-aware Adaptive Routing in MoE-based Large Vision-Language Models
Pith reviewed 2026-06-26 01:30 UTC · model grok-4.3
The pith
Connecting minimum description length to gating entropy lets MoE models choose a variable number of experts per token.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By validating the connection between minimum description length and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE). GeMoE uses the entropy of the gating distribution to assess token complexity and adaptively determines the number of experts each token should engage, explicitly modeling the trade-off between model complexity and performance. On a wide range of backbones and benchmarks this yields 99.5 percent average performance retention relative to static Top-k routing while improving average expert activation sparsity by 36.5 percent.
What carries the argument
Gating entropy computed from the router's softmax distribution, used as a direct proxy for the complexity penalty in a minimum-description-length formulation of routing.
If this is right
- Each token receives a data-dependent expert count instead of a fixed k.
- Average expert activation sparsity rises 36.5 percent without requiring changes to the trained router or experts.
- The same entropy threshold works across multiple vision-language backbones and evaluation suites while retaining 99.5 percent of original performance.
- Routing decisions become explicit uncertainty estimates rather than heuristic rules.
Where Pith is reading between the lines
- The same entropy signal could be monitored at inference time to decide whether to fall back to a smaller expert set on resource-constrained hardware.
- If entropy truly tracks description length, similar thresholds might apply to other sparse architectures that route on softmax outputs.
- Token-level entropy statistics collected during a forward pass could serve as a cheap diagnostic for which inputs the model finds most ambiguous.
Load-bearing premise
The numerical link between minimum description length and the entropy of the gating vector is tight enough that an entropy threshold produces an expert count that preserves task performance.
What would settle it
Run the same models and benchmarks with the entropy-derived threshold; if average accuracy falls more than a few percent below the static Top-k baseline while sparsity gains remain, the claimed MDL-to-entropy equivalence does not hold for routing.
Figures
read the original abstract
With the increase in model parameters and training data, the instruction following and generalization capabilities of Large VisionLanguage Models (LVLMs) have been significantly improved. Based on the Mixture of Experts (MoE) architecture, LVLMs expand their parameter capacity while maintaining the inference cost. However, traditional MoE methods employ a Top-k static routing strategy, which fails to account for variations in the input and adaptively select the number of experts, resulting in suboptimal resource utilization. In this paper, we propose viewing token routing as an information encoding task, framing dynamic routing as a Minimum Description Length (MDL) problem in encoding By validating the connection between MDL and gating entropy in the MoE scenario, we introduce Gating Entropy-based Uncertainty-aware Adaptive Routing (GeMoE) for MoE. Unlike traditional static or heuristic-based dynamic routing methods, GeMoE explicitly models the trade-off between model complexity and performance. By using gating entropy to assess the complexity of tokens, GeMoE adaptively determines the number of experts each token should engage. On a wide range of backbones and benchmarks, our method achieves 99.5% average performance retention compared to the original static routing, while improving average expert activation sparsity by 36.5%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that viewing token routing in MoE-based LVLMs as an MDL problem allows validation of a connection to gating entropy, leading to GeMoE which adaptively selects the number of experts per token using this entropy measure. It reports achieving 99.5% average performance retention compared to static routing while improving expert activation sparsity by 36.5% on various backbones and benchmarks.
Significance. Should the MDL to gating entropy mapping be derived or validated with transparent steps and the performance claims hold under detailed scrutiny, the work would offer a novel uncertainty-aware routing strategy for efficient inference in large vision-language models. The sparsity gains with near-full performance retention would be of practical interest for scaling MoE architectures.
major comments (2)
- [Abstract] The validation of the MDL-gating entropy connection is stated without derivation steps, equations, or description of how the validation was performed. This is load-bearing because the adaptive rule relies on this link to replace static Top-k while preserving performance.
- [Abstract] The performance numbers (99.5% retention, 36.5% sparsity) are aggregate without error bars, ablation details, or per-experiment breakdowns, preventing assessment of whether the central claim is robust.
minor comments (1)
- Notation for gating entropy H(g) could be introduced earlier with a clear definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity on the MDL connection and experimental reporting.
read point-by-point responses
-
Referee: [Abstract] The validation of the MDL-gating entropy connection is stated without derivation steps, equations, or description of how the validation was performed. This is load-bearing because the adaptive rule relies on this link to replace static Top-k while preserving performance.
Authors: The full manuscript (Section 3) derives the link by framing routing as an MDL problem where gating entropy serves as a proxy for description length under uniform expert contribution assumptions, with the adaptive k chosen to minimize H(g) + λ·performance_loss. Validation combined a theoretical equivalence proof and empirical correlation (r > 0.85) on held-out tokens. To address the abstract-level concern, we will insert a concise derivation outline and validation description into the abstract. revision: yes
-
Referee: [Abstract] The performance numbers (99.5% retention, 36.5% sparsity) are aggregate without error bars, ablation details, or per-experiment breakdowns, preventing assessment of whether the central claim is robust.
Authors: We agree the aggregates alone limit scrutiny. The reported values are means across backbones and benchmarks; the revision will add error bars from multiple seeds, per-benchmark and per-backbone tables, and expanded ablations on entropy thresholds to demonstrate robustness. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper frames token routing as an MDL problem and states that it validates a connection to gating entropy to motivate an adaptive threshold. No equations, self-citations, or fitted-parameter steps are exhibited in the provided text that reduce the claimed performance-preserving property or the entropy threshold back to the inputs by construction. The 99.5% retention and 36.5% sparsity figures are presented as empirical outcomes rather than tautological consequences of the router outputs themselves. This is the normal case of a heuristic reframing supported by experiments rather than a definitional loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A direct, usable connection exists between the minimum description length principle and the entropy of MoE gating probabilities that permits adaptive expert selection without performance loss.
Reference graph
Works this paper leans on
-
[1]
In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis
Aminabadi, R.Y., Rajbhandari, S., Awan, A.A., Li, C., Li, D., Zheng, E., Ruwase, O., Smith, S., Zhang, M., Rasley, J., et al.: Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In: SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–15. IEEE (2022)
2022
-
[2]
arXiv preprint arXiv:2309.16609 (2023)
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Pith/arXiv arXiv 2023
-
[3]
arXiv preprint arXiv:2308.129661(2), 3 (2023)
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)
Pith/arXiv arXiv 2023
-
[4]
arXiv preprint arXiv:2511.21631 (2025)
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
Pith/arXiv arXiv 2025
-
[5]
arXiv preprint arXiv:2502.13923 (2025)
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
Pith/arXiv arXiv 2025
-
[6]
Advances in neural information processing systems35, 32897– 32912 (2022)
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of- modality-experts. Advances in neural information processing systems35, 32897– 32912 (2022)
2022
-
[7]
arXiv preprint arXiv:2507.01351 (2025)
Cai, C., Yang, L., Chen, K., Yang, F., Li, X.: Long-tailed distribution-aware router for mixture-of-experts in large vision-language model. arXiv preprint arXiv:2507.01351 (2025)
arXiv 2025
-
[8]
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krishnamoorthi, R., Chan- dra, V., Xiong, Y., Elhoseiny, M.: Minigpt-v2: Large language model as a uni- fied interface for vision-language multi-task learning. arxiv 2023. arXiv preprint arXiv:2310.09478 (2023)
Pith/arXiv arXiv 2023
-
[9]
Chen, L., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Wang, J., Qiao, Y., Lin, D., et al.: Are we on the right way for evaluating large vision- language models? Advances in Neural Information Processing Systems37, 27056– 27087 (2024)
2024
-
[10]
arXiv preprint arXiv:2107.03374 (2021)
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.D.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
Pith/arXiv arXiv 2021
-
[11]
In: International Con- ference on Machine Learning
Chen, W., Zhou, Y., Du, N., Huang, Y., Laudon, J., Chen, Z., Cui, C.: Lifelong language pretraining with distribution-specialized experts. In: International Con- ference on Machine Learning. pp. 5383–5395. PMLR (2023) 16 C. Cai et al
2023
-
[12]
arXiv preprint arXiv:2311.02684 (2023)
Chen, Z., Wang, Z., Wang, Z., Liu, H., Yin, Z., Liu, S., Sheng, L., Ouyang, W., Qiao, Y., Shao, J.: Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684 (2023)
arXiv 2023
-
[13]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[14]
arXiv preprint arXiv:2110.14168 (2021)
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)
Pith/arXiv arXiv 2021
-
[15]
arXiv preprint arXiv:2507.06261 (2025)
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
Pith/arXiv arXiv 2025
-
[16]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al.: Deepseekmoe: Towards ultimate expert specialization in mixture- of-experts language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1280– 1297 (2024)
2024
-
[17]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025)
2025
-
[18]
In: International conference on machine learning
Du, N., Huang, Y., Dai, A.M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A.W., Firat, O., et al.: Glam: Efficient scaling of language models with mixture-of-experts. In: International conference on machine learning. pp. 5547–
-
[19]
arXiv preprint arXiv:2312.17238 (2023)
Eliseev, A., Mazur, D.: Fast inference of mixture-of-experts language models with offloading. arXiv preprint arXiv:2312.17238 (2023)
arXiv 2023
-
[20]
Journal of Machine Learning Research 23(120), 1–39 (2022)
Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23(120), 1–39 (2022)
2022
-
[21]
Advances in Neural Information Processing Systems38(2026)
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)
2026
-
[22]
arXiv preprint arXiv:2312.12379 (2023)
Gou, Y., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.Y., Kwok, J.T., Zhang, Y.: Mixture of cluster-conditional lora experts for vision-language instruc- tion tuning. arXiv preprint arXiv:2312.12379 (2023)
arXiv 2023
-
[23]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017)
2017
-
[24]
arXiv preprint arXiv:2405.14297 (2024)
Guo,Y.,Cheng,Z.,Tang,X.,Tu,Z.,Lin,T.:Dynamicmixtureofexperts:Anauto- tuning approach for efficient transformer models. arXiv preprint arXiv:2405.14297 (2024)
arXiv 2024
-
[25]
arXiv preprint arXiv:2009.03300 (2020)
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)
Pith/arXiv arXiv 2009
-
[26]
In: Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers)
Huang, Q., An, Z., Zhuang, N., Tao, M., Zhang, C., Jin, Y., Xu, K., Chen, L., Huang, S., Feng, Y.: Harder task needs more experts: Dynamic routing in moe GeMoE 17 models. In: Proceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers). pp. 12883–12895 (2024)
2024
-
[27]
In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition. pp. 6700–6709 (2019)
2019
-
[28]
Neural computation3(1), 79–87 (1991)
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation3(1), 79–87 (1991)
1991
-
[29]
arXiv preprint arXiv:2401.04088 (2024)
Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)
Pith/arXiv arXiv 2024
-
[30]
arXiv preprint arXiv:2410.07348 (2024)
Jin, P., Zhu, B., Yuan, L., Yan, S.: Moe++: Accelerating mixture-of-experts meth- ods with zero-computation experts. arXiv preprint arXiv:2410.07348 (2024)
arXiv 2024
-
[31]
In: European conference on computer vision
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A di- agram is worth a dozen images. In: European conference on computer vision. pp. 235–251. Springer (2016)
2016
-
[32]
arXiv preprint arXiv:2212.05055 (2022)
Komatsuzaki,A.,Puigcerver,J.,Lee-Thorp,J.,Ruiz,C.R.,Mustafa,B.,Ainslie,J., Tay, Y., Dehghani, M., Houlsby, N.: Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055 (2022)
arXiv 2022
-
[33]
arXiv preprint arXiv:2006.16668 (2020)
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)
Pith/arXiv arXiv 2006
-
[34]
In: Proceedings of the 2023 conference on empirical methods in natural language processing
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hal- lucination in large vision-language models. In: Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 292–305 (2023)
2023
-
[35]
arXiv preprint arXiv:2511.12609 (2025)
Li, Y., Chen, X., Jiang, S., Shi, H., Liu, Z., Zhang, X., Deng, N., Xu, Z., Ma, Y., Zhang, M., et al.: Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data. arXiv preprint arXiv:2511.12609 (2025)
arXiv 2025
-
[36]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Li, Y., Jiang, S., Hu, B., Wang, L., Zhong, W., Luo, W., Ma, L., Zhang, M.: Uni- moe: Scaling unified multimodal llms with mixture of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
2025
-
[37]
IEEE Transactions on Multimedia (2026)
Lin, B., Tang, Z., Ye, Y., Huang, J., Zhang, J., Pang, Y., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)
2026
-
[38]
arXiv preprint arXiv:2306.145652(3), 6 (2023)
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.145652(3), 6 (2023)
Pith/arXiv arXiv 2023
-
[39]
Advances in neural information processing systems36, 34892–34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)
2023
-
[40]
Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[41]
Science China Information Sciences67(12), 220102 (2024)
Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12), 220102 (2024)
2024
-
[42]
arXiv preprint arXiv:2405.00361 (2024)
Liu,Z.,Luo,J.:Adamole:Fine-tuninglargelanguagemodelswithadaptivemixture of low-rank adaptation experts. arXiv preprint arXiv:2405.00361 (2024)
arXiv 2024
-
[43]
Advances in neural information processing systems35, 2507– 2521 (2022) 18 C
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems35, 2507– 2521 (2022) 18 C. Cai et al
2022
-
[44]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E., Jawahar, C.: Info- graphicvqa. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1697–1706 (2022)
2022
-
[45]
arXiv preprint arXiv:2409.02060 (2024)
Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., Min, S., Shi, W., Walsh, P., Tafjord, O., Lambert, N., et al.: Olmoe: Open mixture-of-experts language models. arXiv preprint arXiv:2409.02060 (2024)
Pith/arXiv arXiv 2024
-
[46]
arXiv preprint arXiv:2308.00951 (2023)
Puigcerver,J.,Riquelme,C.,Mustafa,B.,Houlsby,N.:Fromsparsetosoftmixtures of experts. arXiv preprint arXiv:2308.00951 (2023)
arXiv 2023
-
[47]
OpenAI blog1(8), 9 (2019)
Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)
2019
-
[48]
arXiv preprint arXiv:1701.06538 (2017)
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)
Pith/arXiv arXiv 2017
-
[49]
arXiv preprint arXiv:2305.14705 (2023)
Shen, S., Hou, L., Zhou, Y., Du, N., Longpre, S., Wei, J., Chung, H.W., Zoph, B., Fedus, W., Chen, X., et al.: Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705 (2023)
arXiv 2023
-
[50]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
2019
-
[51]
Song, C., Zhao, W., Han, X., Xiao, C., Chen, Y., Li, Y., Liu, Z., Sun, M.: Blockffn: Towards end-side acceleration-friendly mixture-of-experts with chunk-level activa- tion sparsity (2025),https://arxiv.org/abs/2507.08771
arXiv 2025
-
[52]
In: Findings of the Association for Computational Linguistics: ACL 2023
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowd- hery, A., Le, Q., Chi, E., Zhou, D., et al.: Challenging big-bench tasks and whether chain-of-thought can solve them. In: Findings of the Association for Computational Linguistics: ACL 2023. pp. 13003–13051 (2023)
2023
-
[53]
arXiv preprint arXiv:2412.14711 (2024)
Wang, Z., Zhu, J., Chen, J.: Remoe: Fully differentiable mixture-of-experts with relu routing. arXiv preprint arXiv:2412.14711 (2024)
arXiv 2024
-
[54]
arXiv preprint arXiv:2406.06563 (2024)
Wei, T., Zhu, B., Zhao, L., Cheng, C., Li, B., Lü, W., Cheng, P., Zhang, J., Zhang, X., Zeng, L., et al.: Skywork-moe: A deep dive into training techniques for mixture- of-experts language models. arXiv preprint arXiv:2406.06563 (2024)
arXiv 2024
-
[55]
arXiv preprint arXiv:2412.10302 (2024)
Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)
Pith/arXiv arXiv 2024
-
[56]
arXiv preprint arXiv:2402.01739 (2024)
Xue, F., Zheng, Z., Fu, Y., Ni, J., Zheng, Z., Zhou, W., You, Y.: Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739 (2024)
arXiv 2024
-
[57]
arXiv e-prints pp
Xue, L., Fu, Y., Lu, Z., Mai, L., Marina, M.: Moe-infinity: Offloading-efficient moe model serving. arXiv e-prints pp. arXiv–2401 (2024)
2024
-
[58]
arXiv preprint arXiv:2505.09388 (2025)
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
Pith/arXiv arXiv 2025
-
[59]
arXiv preprint arXiv:2308.02490 (2023)
Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Pith/arXiv arXiv 2023
-
[60]
In: The Thirteenth International Conference on Learning Representations (2024) GeMoE 19
Yue, T., Guo, L., Cheng, J., Gao, X., Huang, H., Liu, J.: Ada-k routing: Boosting the efficiency of moe-based llms. In: The Thirteenth International Conference on Learning Representations (2024) GeMoE 19
2024
-
[61]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.: Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9556–9567 (2024)
2024
-
[62]
In: Findings of the Association for Computational Linguistics: EMNLP 2024
Zeng, Z., Miao, Y., Gao, H., Zhang, H., Deng, Z.: Adamoe: Token-adaptive rout- ing with null experts for mixture-of-experts language models. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 6223–6235 (2024)
2024
-
[63]
arXiv preprint arXiv:2306.17107 (2023)
Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., Sun, T.: Llavar: En- hanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107 (2023)
arXiv 2023
-
[64]
In: Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Zhong, S., Gao, S., Huang, Z., Wen, W., Žitnik, M., Zhou, P.: Moextend: Tuning new experts for modality and task extension. In: Proceedings of the 62nd An- nual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). pp. 494–505 (2024)
2024
-
[65]
Zhou, Y., Zhao, Z., Li, H., Du, S., Yao, J., Zhang, Y., Wang, Y.: Exploring training on heterogeneous data with mixture of low-rank adapters. arXiv preprint arXiv:2406.09679 (2024) A Methodological Details To provide a clearer description of our proposed method, we first present a detailed explanation of the symbols mentioned in the Methodology section, a...
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.