pith. machine review for the scientific record. sign in

arxiv: 2401.15947 · v5 · submitted 2024-01-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords mixture of expertsvision-language modelssparse activationlarge multimodal modelsobject hallucinationefficient scalingMoE-Tuning
0
0 comments X

The pith

A sparse vision-language model activates only 3 billion parameters yet matches the performance of a 7 billion parameter dense model on visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoE-Tuning, a training approach that lets large vision-language models adopt sparse expert activation without the usual loss in capability. This produces models with a huge total parameter count but fixed computational demands, since only the top-k experts are used for any given token. The resulting MoE-LLaVA architecture routes inputs to a small active subset of experts while leaving the rest inactive. On standard visual understanding benchmarks the 3B-active-parameter version performs at the level of the dense LLaVA-1.5-7B model and exceeds the LLaVA-1.5-13B model on object hallucination tests. The work therefore offers a concrete route to scaling multi-modal capacity without proportional growth in training or inference cost.

Core claim

MoE-LLaVA is a sparse large vision-language model constructed via the MoE-Tuning strategy, which overcomes the performance degradation normally seen when sparsity is applied to multi-modal models. During inference only the top-k experts are activated through learned routers, keeping the remaining experts inactive and thereby holding computational cost constant regardless of total parameter count. With roughly 3B sparsely activated parameters the model reaches performance comparable to LLaVA-1.5-7B across visual understanding datasets and surpasses LLaVA-1.5-13B on object hallucination benchmarks.

What carries the argument

MoE-Tuning training strategy combined with router-based top-k expert selection that keeps only a fixed number of experts active per token.

If this is right

  • Vision-language models can scale total parameters far beyond active compute without proportional increases in training or inference cost.
  • Expert specialization under MoE-Tuning can reduce object hallucination more effectively than simply enlarging a dense model.
  • Constant computational cost at inference time enables deployment of models with arbitrarily large expert pools as long as only top-k activation is used.
  • The same sparsity pattern can be applied to other multi-modal tasks while retaining dense-model accuracy levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If routing decisions prove sensitive to input modality, the architecture could naturally support mixed vision-text-audio inputs with minimal extra overhead.
  • Extending MoE-Tuning to video or 3D understanding would test whether the same stability holds when temporal or spatial structure is added.
  • The fixed active-parameter budget suggests that future versions could increase total experts while keeping inference latency unchanged, provided router quality scales.

Load-bearing premise

The MoE-Tuning strategy reliably prevents the performance degradation that sparsity normally causes in multi-modal models.

What would settle it

Training an equivalent dense LLaVA-style model and an MoE-Tuned sparse version on the same data and observing that the sparse version falls measurably behind on the reported visual understanding and hallucination benchmarks would falsify the central claim.

read the original abstract

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MoE-Tuning, a training strategy claimed to prevent performance degradation when applying sparsity to large vision-language models. It introduces the MoE-LLaVA architecture that activates only the top-k experts via routers at inference time. With approximately 3B sparsely activated parameters, the model is reported to match LLaVA-1.5-7B on visual understanding benchmarks and surpass LLaVA-1.5-13B on object hallucination benchmarks, while releasing code for reproducibility.

Significance. If the central claims hold after additional controls, the work would provide a practical baseline for sparse LVLMs that increase total parameter count without raising inference FLOPs. The public code release supports reproducibility and follow-on research in efficient multi-modal scaling.

major comments (2)
  1. [3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.
  2. [4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.
minor comments (2)
  1. Abstract: The phrase 'outrageous number of parameters' is informal; replace with a precise statement of total versus active parameter counts.
  2. [3] Notation: The router and expert-capacity details in the architecture description would benefit from an explicit equation or pseudocode block to clarify top-k selection and load balancing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address the major comments below and have made revisions to the manuscript accordingly.

read point-by-point responses
  1. Referee: [3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.

    Authors: We agree that a controlled ablation would provide stronger support for the effectiveness of MoE-Tuning in preventing performance degradation. In the revised manuscript, we will include an ablation study comparing the performance of the sparse architecture trained with and without the MoE-Tuning schedule on the VQA, GQA, and POPE benchmarks. revision: yes

  2. Referee: [4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.

    Authors: The LLaVA-1.5 baseline results are taken from the original publication to ensure consistent evaluation protocols. We have revised Section 4 to explicitly state that our training setup, including data and hyperparameters, follows the LLaVA-1.5 configuration. Due to the substantial computational resources required for multiple training runs, we report single-run results; however, we will add standard deviations from repeated inference evaluations on the test sets in the revised tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results are self-contained

full rationale

The paper introduces an MoE architecture and MoE-Tuning training procedure for LVLMs, then reports measured performance on standard held-out visual understanding and hallucination benchmarks. No mathematical derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on direct experimental outcomes rather than internal re-derivations, satisfying the criteria for a non-circular empirical study.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer and MoE assumptions plus the empirical effectiveness of the two-stage tuning schedule; no new physical constants or invented entities are introduced.

free parameters (2)
  • number of experts
    Chosen architecture hyperparameter that determines total parameter count while active count stays fixed.
  • top-k value
    Routing hyperparameter controlling sparsity level during inference.
axioms (2)
  • standard math Standard transformer attention and feed-forward blocks remain unchanged except for replacement of FFN with MoE layers.
    Invoked when describing the base LLaVA architecture.
  • domain assumption Top-k expert selection via learned router produces stable training when combined with MoE-Tuning.
    Central to avoiding the performance drop mentioned in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1380 out tokens · 52654 ms · 2026-05-16T02:28:13.774116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. On the Optimality of Hierarchical Secure Aggregation with Arbitrary Heterogeneous Data Assignment

    cs.IT 2026-04 unverdicted novelty 7.0

    A hierarchical secure aggregation scheme with arbitrary heterogeneous data assignment achieves optimal two-layer communication loads under information-theoretic security against collusions and dropouts.

  3. VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

    cs.LG 2026-05 unverdicted novelty 6.0

    VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.

  4. Learngene Search Across Multiple Datasets for Building Variable-Sized Models

    cs.LG 2026-05 unverdicted novelty 6.0

    LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and ...

  5. State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

    cs.CV 2026-04 unverdicted novelty 6.0

    MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...

  6. SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.

  7. Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...

  8. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  9. CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

    cs.CV 2026-04 unverdicted novelty 6.0

    CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

  10. ImgEdit: A Unified Image Editing Dataset and Benchmark

    cs.CV 2025-05 conditional novelty 6.0

    ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

  11. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  12. An Efficient Token Compression Framework for Visual Object Tracking

    cs.CV 2026-05 unverdicted novelty 5.0

    ETCTrack compresses template tokens by 60% in visual trackers via an adaptive compressor and hierarchical interaction, cutting MACs 21.4% with 0.4% accuracy drop on seven benchmarks.

  13. SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.

  14. CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 5.0

    CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.

  15. Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

  16. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  17. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  18. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  19. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 19 Pith papers · 21 internal anchors

  1. [1]

    Adaptive Input Representations for Neural Language Modeling

    Baevski, A. and Auli, M. Adaptive input representa- tions for neural language modeling. arXiv preprint arXiv:1809.10853,

  2. [2]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.1...

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    arXiv preprint arXiv:2312.06742 (2023)

    Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality- enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742,

  5. [5]

    Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe

    Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., and Zhang, D. Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe. arXiv preprint arXiv:2308.11971, 2023a. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- namoorthi, R., Chandra, V ., Xiong, Y ., and Elhoseiny, M. Minigpt-v2: large language model...

  6. [6]

    arXiv preprint arXiv:2312.16886 (2023)

    Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y ., Hu, Y ., Wei, F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886,

  7. [7]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

  8. [8]

    Glm: General language model pretrain- ing with autoregressive blank infilling

    Du, Z., Qian, Y ., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretrain- ing with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,

  9. [9]

    Learning Factored Representations in a Deep Mixture of Experts

    Eigen, D., Ranzato, M., and Sutskever, I. Learning fac- tored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314,

  10. [10]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,

  11. [11]

    Multimodal-gpt: A vision and language model for dialogue with humans

    Gong, T., Lyu, C., Zhang, S., Wang, Y ., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,

  12. [12]

    T., and Zhang, Y

    Gou, Y ., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.-Y ., Kwok, J. T., and Zhang, Y . Mixture of cluster- conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379,

  13. [13]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,

  14. [14]

    Y ., Salakhutdinov, R., and Fried, D

    Koh, J. Y ., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,

  15. [15]

    R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N

    Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N. Sparse upcycling: Training mixture- of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055,

  16. [16]

    Beyond distillation: Task- level mixture-of-experts for efficient inference

    Kudugunta, S., Huang, Y ., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task- level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742,

  17. [17]

    arXiv preprint arXiv:2308.00692 (2023)

    11 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Lai, X., Tian, Z., Chen, Y ., Li, Y ., Yuan, Y ., Liu, S., and Jia, J. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,

  18. [18]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,

  19. [19]

    arXiv preprint arXiv:2306.05425 (2023)

    Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., and Liu, Z. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Confer- ence on Machine ...

  20. [20]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Li, X., Yao, Y ., Jiang, X., Fang, X., Meng, X., Fan, S., Han, P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm and how to train it with 100 k budget. arXiv p...

  21. [21]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Lin, B., Zhu, B., Ye, Y ., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representa- tion by alignment before projection. arXiv preprint arXiv:2311.10122,

  22. [22]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruc...

  23. [23]

    Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language

    Liu, Z., He, Y ., Wang, W., Wang, W., Wang, Y ., Chen, S., Zhang, Q., Lai, Z., Yang, Y ., Li, Q., et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023e. Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A. Multiway-adapater: Adapting large-scale multi-modal models for sca...

  24. [24]

    Cot-mote: Explor- ing contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval

    Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Explor- ing contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195,

  25. [25]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    12 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

  26. [26]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,

  27. [27]

    Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., and Zhang, L. K. T. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167,

  28. [28]

    Glamm: Pixel grounding large multimodal model,

    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and Khan, F. S. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356,

  29. [29]

    Satar, B., Zhu, H., Zhang, H., and Lim, J. H. Rome: Role- aware mixture-of-expert transformer for text-to-video re- trieval. arXiv preprint arXiv:2206.12845,

  30. [30]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

  31. [31]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  32. [32]

    Scaling vision-language models with sparse mixture of experts

    Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y . Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226,

  33. [33]

    Moss: Train- ing conversational language models from synthetic data

    Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y ., Tang, Q., Zhao, X., et al. Moss: Train- ing conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7,

  34. [34]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,

  35. [35]

    Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/ stablelm-2-1.6b](https://huggingface. co/stabilityai/stablelm-2-1.6b). Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv...

  36. [36]

    K., Singhal, S., Som, S., et al

    Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442,

  37. [37]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y ., et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks. arXiv preprint arXiv:2305.11175, 2023c. 13 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y ., Ji, ...

  38. [38]

    Baichuan 2: Open large-scale language models

    Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,

  39. [39]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., et al. mplug-owl: Modulariza- tion empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

  40. [40]

    A Survey on Multimodal Large Language Models

    Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549,

  41. [41]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

  42. [42]

    Tinygpt-v: Efficient multi- modal large language model via small backbones

    Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multi- modal large language model via small backbones. arXiv preprint arXiv:2312.16862,

  43. [43]

    GLM-130B: An Open Bilingual Pre-trained Model

    Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414,

  44. [44]

    Zhang, P., Wang, X. D. B., Cao, Y ., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Li...

  45. [45]

    Instruction tuning for large language models: A survey

    Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b. Zhang, X. and Yang, Q. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Con...

  46. [46]

    Llavar: Enhanced visual instruction tuning for text-rich image understanding

    Zhang, Y ., Zhang, R., Gu, J., Zhou, Y ., Lipka, N., Yang, D., and Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c. Zhao, B., Wu, B., and Huang, T. Svit: Scaling up vi- sual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng,...

  47. [47]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  48. [48]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906,

  49. [49]

    1.6B×4-Top2

    “1.6B×4-Top2” represents a dense foundation model with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “†” donates all layers will equipped with MoE layer. Name Experts Top-k MoE Embedding Width Layers FFN FFN Heads Activated Total Layers Factor Param Param StableLM-1.6B (Team) - - - 100352 2560 32 1...

  50. [50]

    Training hyperparameters. Config Stage I Stage II Stage III Experts - - 4 Top-k - - 2 Deepspeed Zero2 Zero2 Zero2 offload Data LLaV A-PT Hybird-PT LLaV A-FT Image resolution 336×336 Image encoder CLIP-Large/336 Feature select layer -2 Image projector 2 Linear layers with GeLU Epoch 1 Learning rate 1e-3 2e-5 2e-5 Learning rate schdule Cosine Weight decay 0...

  51. [51]

    Ablation study about the model size of MoE-LLaV A. Model MoE VQAv2 SQAI VQAT MMB LLaV A W StableLM ✗ 74.5 62.0 48.8 58.2 83.2 ✔ 76.0 62.6 47.8 59.4 85.9 Qwen ✗ 74.9 60.2 48.3 60.6 86.3 ✔ 76.2 63.1 48.0 59.7 88.7 Phi-2 ✗ 75.6 67.8 50.0 65.0 91.3 ✔ 77.6 68.5 51.4 65.2 94.1 OpenChat ✗ 77.9 69.0 54.7 66.9 89.7 ✔ 78.9 62.8 52.5 65.9 86.3 As shown in Table 10, ...

  52. [52]

    Methods Res

    Ablation study about the capacity of MoE-LLaV A.“Res.” represent the input image resolution.∗donates that there is some overlap in the training data. Methods Res. Capacity Image Question Answering Benchmark Toolkit VQAv2 GQA VisWiz SQA I VQAT POPE MMB LLaV A W MM-Vet Avg MoE-LLaV A-1.6B×4-Top2 336 1.5 76.7∗60.3∗ 36.2 62.6 50.1 85.7 60.2 86.8 26.9 60.6 1.0...

  53. [53]

    Visual input example, Tricky Question and Image: User If there are factual errors in the questions, point it out; if not, proceed answering the question

    Exhibition Board of MoE-LLaV A.MoE-LLaV A demonstrates the ability to detect and answer challenging questions when prompted to verify them. Visual input example, Tricky Question and Image: User If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert? LLaV A-1.5 There are no deserts...