arxiv: 2401.15947 · v5 · submitted 2024-01-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Bin Lin , Zhenyu Tang , Yang Ye , Jinfa Huang , Junwu Zhang , Yatian Pang , Peng Jin , Munan Ning

show 2 more authors

Jiebo Luo Li Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords mixture of expertsvision-language modelssparse activationlarge multimodal modelsobject hallucinationefficient scalingMoE-Tuning

0 comments

The pith

A sparse vision-language model activates only 3 billion parameters yet matches the performance of a 7 billion parameter dense model on visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoE-Tuning, a training approach that lets large vision-language models adopt sparse expert activation without the usual loss in capability. This produces models with a huge total parameter count but fixed computational demands, since only the top-k experts are used for any given token. The resulting MoE-LLaVA architecture routes inputs to a small active subset of experts while leaving the rest inactive. On standard visual understanding benchmarks the 3B-active-parameter version performs at the level of the dense LLaVA-1.5-7B model and exceeds the LLaVA-1.5-13B model on object hallucination tests. The work therefore offers a concrete route to scaling multi-modal capacity without proportional growth in training or inference cost.

Core claim

MoE-LLaVA is a sparse large vision-language model constructed via the MoE-Tuning strategy, which overcomes the performance degradation normally seen when sparsity is applied to multi-modal models. During inference only the top-k experts are activated through learned routers, keeping the remaining experts inactive and thereby holding computational cost constant regardless of total parameter count. With roughly 3B sparsely activated parameters the model reaches performance comparable to LLaVA-1.5-7B across visual understanding datasets and surpasses LLaVA-1.5-13B on object hallucination benchmarks.

What carries the argument

MoE-Tuning training strategy combined with router-based top-k expert selection that keeps only a fixed number of experts active per token.

If this is right

Vision-language models can scale total parameters far beyond active compute without proportional increases in training or inference cost.
Expert specialization under MoE-Tuning can reduce object hallucination more effectively than simply enlarging a dense model.
Constant computational cost at inference time enables deployment of models with arbitrarily large expert pools as long as only top-k activation is used.
The same sparsity pattern can be applied to other multi-modal tasks while retaining dense-model accuracy levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If routing decisions prove sensitive to input modality, the architecture could naturally support mixed vision-text-audio inputs with minimal extra overhead.
Extending MoE-Tuning to video or 3D understanding would test whether the same stability holds when temporal or spatial structure is added.
The fixed active-parameter budget suggests that future versions could increase total experts while keeping inference latency unchanged, provided router quality scales.

Load-bearing premise

The MoE-Tuning strategy reliably prevents the performance degradation that sparsity normally causes in multi-modal models.

What would settle it

Training an equivalent dense LLaVA-style model and an MoE-Tuned sparse version on the same data and observing that the sparse version falls measurably behind on the reported visual understanding and hallucination benchmarks would falsify the central claim.

read the original abstract

Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoE-LLaVA gets solid benchmark results with a sparse MoE setup on LLaVA, but the paper doesn't isolate whether their tuning method is what prevents the expected performance drop.

read the letter

The main point here is that MoE-LLaVA shows how to build a sparse large vision-language model that keeps active compute low while matching or beating dense baselines on key tasks. With only about 3B sparsely activated parameters, it performs comparably to LLaVA-1.5-7B on visual understanding datasets and surpasses the 13B version on object hallucination. What stands out is the MoE-Tuning strategy, which they claim fixes the common degradation seen in multi-modal sparsity. This is a direct application of MoE ideas from language models to the LLaVA training pipeline, and they release the code, which is a plus for reproducibility. The architecture activates only top-k experts during inference, keeping the rest inactive, which directly addresses the scaling costs mentioned in the abstract. The soft spot is the lack of a clean ablation for the tuning itself. The stress-test note is right that they don't show results for an identical sparse model trained without MoE-Tuning to measure the drop. Without that, it's hard to credit the tuning specifically versus other factors like the number of experts or training tokens. The benchmarks look good on the surface, but more controls would make the central argument tighter. Also, since the review is based on the abstract, the full paper likely has more details, but the absence of error bars in the reported claims leaves some room for doubt on robustness. This paper is aimed at researchers working on efficient scaling of vision-language models. Anyone dealing with training budgets or inference speed in multi-modal settings could pick up practical ideas from it. It deserves serious peer review because the results are relevant and the approach is straightforward enough to evaluate properly with the right ablations.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MoE-Tuning, a training strategy claimed to prevent performance degradation when applying sparsity to large vision-language models. It introduces the MoE-LLaVA architecture that activates only the top-k experts via routers at inference time. With approximately 3B sparsely activated parameters, the model is reported to match LLaVA-1.5-7B on visual understanding benchmarks and surpass LLaVA-1.5-13B on object hallucination benchmarks, while releasing code for reproducibility.

Significance. If the central claims hold after additional controls, the work would provide a practical baseline for sparse LVLMs that increase total parameter count without raising inference FLOPs. The public code release supports reproducibility and follow-on research in efficient multi-modal scaling.

major comments (2)

[3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.
[4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.

minor comments (2)

Abstract: The phrase 'outrageous number of parameters' is informal; replace with a precise statement of total versus active parameter counts.
[3] Notation: The router and expert-capacity details in the architecture description would benefit from an explicit equation or pseudocode block to clarify top-k selection and load balancing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address the major comments below and have made revisions to the manuscript accordingly.

read point-by-point responses

Referee: [3.2] §3.2: The claim that MoE-Tuning addresses the common issue of performance degradation in multi-modal sparsity learning is load-bearing for the headline result, yet the manuscript provides no controlled ablation of an otherwise identical sparse architecture trained without the proposed MoE-Tuning schedule on the same VQA, GQA, and POPE splits.

Authors: We agree that a controlled ablation would provide stronger support for the effectiveness of MoE-Tuning in preventing performance degradation. In the revised manuscript, we will include an ablation study comparing the performance of the sparse architecture trained with and without the MoE-Tuning schedule on the VQA, GQA, and POPE benchmarks. revision: yes
Referee: [4] §4: Benchmark tables report point estimates for MoE-LLaVA versus LLaVA-1.5-7B/13B without error bars, standard deviations across runs, or explicit confirmation that baseline models were re-implemented with identical data, token counts, and hyperparameters, preventing verification that the ~3B-active-parameter model truly retains dense-model capability.

Authors: The LLaVA-1.5 baseline results are taken from the original publication to ensure consistent evaluation protocols. We have revised Section 4 to explicitly state that our training setup, including data and hyperparameters, follows the LLaVA-1.5 configuration. Due to the substantial computational resources required for multiple training runs, we report single-run results; however, we will add standard deviations from repeated inference evaluations on the test sets in the revised tables. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results are self-contained

full rationale

The paper introduces an MoE architecture and MoE-Tuning training procedure for LVLMs, then reports measured performance on standard held-out visual understanding and hallucination benchmarks. No mathematical derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on direct experimental outcomes rather than internal re-derivations, satisfying the criteria for a non-circular empirical study.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard transformer and MoE assumptions plus the empirical effectiveness of the two-stage tuning schedule; no new physical constants or invented entities are introduced.

free parameters (2)

number of experts
Chosen architecture hyperparameter that determines total parameter count while active count stays fixed.
top-k value
Routing hyperparameter controlling sparsity level during inference.

axioms (2)

standard math Standard transformer attention and feed-forward blocks remain unchanged except for replacement of FFN with MoE layers.
Invoked when describing the base LLaVA architecture.
domain assumption Top-k expert selection via learned router produces stable training when combined with MoE-Tuning.
Central to avoiding the performance drop mentioned in the abstract.

pith-pipeline@v0.9.0 · 5578 in / 1380 out tokens · 52654 ms · 2026-05-16T02:28:13.774116+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
On the Optimality of Hierarchical Secure Aggregation with Arbitrary Heterogeneous Data Assignment
cs.IT 2026-04 unverdicted novelty 7.0

A hierarchical secure aggregation scheme with arbitrary heterogeneous data assignment achieves optimal two-layer communication loads under information-theoretic security against collusions and dropouts.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
Learngene Search Across Multiple Datasets for Building Variable-Sized Models
cs.LG 2026-05 unverdicted novelty 6.0

LSAMD searches a multi-dataset super Ans-Net to extract frequently selected base blocks as learngenes that initialize variable-sized Des-Nets with performance comparable to full pretrain-finetune at lower storage and ...
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs
cs.CV 2026-04 unverdicted novelty 6.0

SMoES improves MoE-VLM performance and efficiency via soft modality-guided expert routing and inter-bin mutual information regularization, yielding 0.9-4.2% task gains and 56% communication reduction.
Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
cs.CV 2026-04 unverdicted novelty 6.0

Transferring a 2D MLLM to 3D CT inputs via parameter reuse, a Text-Guided Hierarchical MoE framework, and two-stage training yields better performance than prior 3D medical MLLMs on medical report generation and visua...
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
ImgEdit: A Unified Image Editing Dataset and Benchmark
cs.CV 2025-05 conditional novelty 6.0

ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
An Efficient Token Compression Framework for Visual Object Tracking
cs.CV 2026-05 unverdicted novelty 5.0

ETCTrack compresses template tokens by 60% in visual trackers via an adaptive compressor and hierarchical interaction, cutting MACs 21.4% with 0.4% accuracy drop on seven benchmarks.
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
cs.CV 2026-04 unverdicted novelty 5.0

SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

CoGR-MoE improves VQA by using concept-guided expert routing with option feature reweighting and contrastive learning to achieve consistent yet flexible reasoning across answer options.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation
cs.CV 2026-04 unverdicted novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 19 Pith papers · 21 internal anchors

[1]

Adaptive Input Representations for Neural Language Modeling

Baevski, A. and Auli, M. Adaptive input representa- tions for neural language modeling. arXiv preprint arXiv:1809.10853,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a. Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision- language model with versatile abilities. arXiv preprint arXiv:2308.1...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[4]

arXiv preprint arXiv:2312.06742 (2023)

Cha, J., Kang, W., Mun, J., and Roh, B. Honeybee: Locality- enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742,

work page arXiv
[5]

Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe

Chen, J., Guo, L., Sun, J., Shao, S., Yuan, Z., Lin, L., and Zhang, D. Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe. arXiv preprint arXiv:2308.11971, 2023a. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- namoorthi, R., Chandra, V ., Xiong, Y ., and Elhoseiny, M. Minigpt-v2: large language model...

work page arXiv 2023
[6]

arXiv preprint arXiv:2312.16886 (2023)

Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y ., Hu, Y ., Wei, F., Zhang, X., Zhang, B., Wei, X., et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886,

work page arXiv
[7]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Glm: General language model pretrain- ing with autoregressive blank infilling

Du, Z., Qian, Y ., Liu, X., Ding, M., Qiu, J., Yang, Z., and Tang, J. Glm: General language model pretrain- ing with autoregressive blank infilling. arXiv preprint arXiv:2103.10360,

work page arXiv
[9]

Learning Factored Representations in a Deep Mixture of Experts

Eigen, D., Ranzato, M., and Sutskever, I. Learning fac- tored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y ., and Ji, R. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Multimodal-gpt: A vision and language model for dialogue with humans

Gong, T., Lyu, C., Zhang, S., Wang, Y ., Zheng, M., Zhao, Q., Liu, K., Zhang, W., Luo, P., and Chen, K. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790,

work page arXiv
[12]

T., and Zhang, Y

Gou, Y ., Liu, Z., Chen, K., Hong, L., Xu, H., Li, A., Yeung, D.-Y ., Kwok, J. T., and Zhang, Y . Mixture of cluster- conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379,

work page arXiv
[13]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Y ., Salakhutdinov, R., and Fried, D

Koh, J. Y ., Salakhutdinov, R., and Fried, D. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,

work page arXiv
[15]

R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N

Komatsuzaki, A., Puigcerver, J., Lee-Thorp, J., Ruiz, C. R., Mustafa, B., Ainslie, J., Tay, Y ., Dehghani, M., and Houlsby, N. Sparse upcycling: Training mixture- of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055,

work page arXiv
[16]

Beyond distillation: Task- level mixture-of-experts for efficient inference

Kudugunta, S., Huang, Y ., Bapna, A., Krikun, M., Lepikhin, D., Luong, M.-T., and Firat, O. Beyond distillation: Task- level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742,

work page arXiv
[17]

arXiv preprint arXiv:2308.00692 (2023)

11 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Lai, X., Tian, Z., Chen, Y ., Li, Y ., Yuan, Y ., Liu, S., and Jia, J. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692,

work page arXiv
[18]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Lepikhin, D., Lee, H., Xu, Y ., Chen, D., Firat, O., Huang, Y ., Krikun, M., Shazeer, N., and Chen, Z. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

arXiv preprint arXiv:2306.05425 (2023)

Li, B., Zhang, Y ., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., and Liu, Z. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Confer- ence on Machine ...

work page arXiv
[20]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot- strapping language-image pre-training with frozen im- age encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b. Li, X., Yao, Y ., Jiang, X., Fang, X., Meng, X., Fan, S., Han, P., Li, J., Du, L., Qin, B., et al. Flm-101b: An open llm and how to train it with 100 k budget. arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Lin, B., Zhu, B., Ye, Y ., Ning, M., Jin, P., and Yuan, L. Video-llava: Learning united visual representa- tion by alignment before projection. arXiv preprint arXiv:2311.10122,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y ., and Wang, L. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruc...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language

Liu, Z., He, Y ., Wang, W., Wang, W., Wang, Y ., Chen, S., Zhang, Q., Lai, Z., Yang, Y ., Li, Q., et al. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023e. Long, Z., Killick, G., McCreadie, R., and Camarasa, G. A. Multiway-adapater: Adapting large-scale multi-modal models for sca...

work page arXiv
[24]

Cot-mote: Explor- ing contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval

Ma, G., Wu, X., Wang, P., and Hu, S. Cot-mote: Explor- ing contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195,

work page arXiv
[25]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

12 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cap- pelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: out- performing curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., and Zhang, L. K. T. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167,

work page arXiv
[28]

Glamm: Pixel grounding large multimodal model,

Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R. M., Xing, E., Yang, M.-H., and Khan, F. S. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356,

work page arXiv
[29]

Satar, B., Zhu, H., Zhang, H., and Lim, J. H. Rome: Role- aware mixture-of-expert transformer for text-to-video re- trieval. arXiv preprint arXiv:2206.12845,

work page arXiv
[30]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili´c, S., Hesslow, D., Castagn´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Scaling vision-language models with sparse mixture of experts

Shen, S., Yao, Z., Li, C., Darrell, T., Keutzer, K., and He, Y . Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226,

work page arXiv
[33]

Moss: Train- ing conversational language models from synthetic data

Sun, T., Zhang, X., He, Z., Li, P., Cheng, Q., Yan, H., Liu, X., Shao, Y ., Tang, Q., Zhao, X., et al. Moss: Train- ing conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7,

work page arXiv
[34]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,

work page 2023
[35]

Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/ stablelm-2-1.6b](https://huggingface. co/stabilityai/stablelm-2-1.6b). Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[36]

K., Singhal, S., Som, S., et al

Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442,

work page arXiv
[37]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wang, W., Chen, Z., Chen, X., Wu, J., Zhu, X., Zeng, G., Luo, P., Lu, T., Zhou, J., Qiao, Y ., et al. Visionllm: Large language model is also an open-ended decoder for vision- centric tasks. arXiv preprint arXiv:2305.11175, 2023c. 13 MoE-LLaV A: Mixture of Experts for Large Vision-Language Models Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y ., Ji, ...

work page arXiv
[38]

Baichuan 2: Open large-scale language models

Yang, A., Xiao, B., Wang, B., Zhang, B., Bian, C., Yin, C., Lv, C., Pan, D., Wang, D., Yan, D., et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305,

work page arXiv
[39]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y ., Wang, J., Hu, A., Shi, P., Shi, Y ., et al. mplug-owl: Modulariza- tion empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

A Survey on Multimodal Large Language Models

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549,

work page internal anchor Pith review arXiv
[41]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multi- modal models for integrated capabilities. arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Tinygpt-v: Efficient multi- modal large language model via small backbones

Yuan, Z., Li, Z., and Sun, L. Tinygpt-v: Efficient multi- modal large language model via small backbones. arXiv preprint arXiv:2312.16862,

work page arXiv
[43]

GLM-130B: An Open Bilingual Pre-trained Model

Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Zhang, P., Wang, X. D. B., Cao, Y ., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Yan, H., et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Li...

work page arXiv
[45]

Instruction tuning for large language models: A survey

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., et al. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b. Zhang, X. and Yang, Q. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Con...

work page arXiv
[46]

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Zhang, Y ., Zhang, R., Gu, J., Zhou, Y ., Lipka, N., Yang, D., and Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c. Zhao, B., Wu, B., and Huang, T. Svit: Scaling up vi- sual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng,...

work page arXiv
[47]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Zoph, B., Bello, I., Kumar, S., Du, N., Huang, Y ., Dean, J., Shazeer, N., and Fedus, W. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

1.6B×4-Top2

“1.6B×4-Top2” represents a dense foundation model with 1.6B parameters, which will be equipped with a total of four experts, with two of them being activated. “†” donates all layers will equipped with MoE layer. Name Experts Top-k MoE Embedding Width Layers FFN FFN Heads Activated Total Layers Factor Param Param StableLM-1.6B (Team) - - - 100352 2560 32 1...

work page 2048
[50]

Training hyperparameters. Config Stage I Stage II Stage III Experts - - 4 Top-k - - 2 Deepspeed Zero2 Zero2 Zero2 offload Data LLaV A-PT Hybird-PT LLaV A-FT Image resolution 336×336 Image encoder CLIP-Large/336 Feature select layer -2 Image projector 2 Linear layers with GeLU Epoch 1 Learning rate 1e-3 2e-5 2e-5 Learning rate schdule Cosine Weight decay 0...

work page 2048
[51]

Ablation study about the model size of MoE-LLaV A. Model MoE VQAv2 SQAI VQAT MMB LLaV A W StableLM ✗ 74.5 62.0 48.8 58.2 83.2 ✔ 76.0 62.6 47.8 59.4 85.9 Qwen ✗ 74.9 60.2 48.3 60.6 86.3 ✔ 76.2 63.1 48.0 59.7 88.7 Phi-2 ✗ 75.6 67.8 50.0 65.0 91.3 ✔ 77.6 68.5 51.4 65.2 94.1 OpenChat ✗ 77.9 69.0 54.7 66.9 89.7 ✔ 78.9 62.8 52.5 65.9 86.3 As shown in Table 10, ...

work page 2021
[52]

Methods Res

Ablation study about the capacity of MoE-LLaV A.“Res.” represent the input image resolution.∗donates that there is some overlap in the training data. Methods Res. Capacity Image Question Answering Benchmark Toolkit VQAv2 GQA VisWiz SQA I VQAT POPE MMB LLaV A W MM-Vet Avg MoE-LLaV A-1.6B×4-Top2 336 1.5 76.7∗60.3∗ 36.2 62.6 50.1 85.7 60.2 86.8 26.9 60.6 1.0...

work page 2022
[53]

Visual input example, Tricky Question and Image: User If there are factual errors in the questions, point it out; if not, proceed answering the question

Exhibition Board of MoE-LLaV A.MoE-LLaV A demonstrates the ability to detect and answer challenging questions when prompted to verify them. Visual input example, Tricky Question and Image: User If there are factual errors in the questions, point it out; if not, proceed answering the question. What’s happening in the desert? LLaV A-1.5 There are no deserts...

work page 2023