arxiv: 2311.04257 · v2 · pith:Q3FUYAY4new · submitted 2023-11-07 · 💻 cs.CL · cs.CV

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

Qinghao Ye , Haiyang Xu , Jiabo Ye , Ming Yan , Anwen Hu , Haowei Liu , Qi Qian , Ji Zhang

show 2 more authors

Fei Huang Jingren Zhou

This is my paper

Pith reviewed 2026-05-18 03:15 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords multi-modal large language modelsmodality collaborationmodularized network designstate-of-the-art performancetext and multi-modal tasksshared functional modules

0 comments

The pith

mPLUG-Owl2 reaches state-of-the-art on text and multi-modal tasks by using shared modules for modality collaboration in one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents mPLUG-Owl2 as a versatile multi-modal large language model that improves results on both text-only and multi-modal tasks by enabling collaboration between modalities. It relies on a modularized network where the language decoder serves as a common interface, shared functional modules promote interaction across modalities, and a modality-adaptive module keeps each modality's distinct characteristics intact. Experiments demonstrate that this single generic model generalizes effectively to both kinds of tasks and delivers top performance levels. The work also claims to be the first to observe the modality collaboration phenomenon appearing in pure-text settings as well as multi-modal ones. This design points toward building future foundation models around cross-modality sharing rather than isolated enhancements.

Core claim

mPLUG-Owl2 effectively leverages modality collaboration to improve performance in both text and multi-modal tasks through a modularized network design with the language decoder acting as a universal interface. Shared functional modules facilitate modality collaboration while a modality-adaptive module preserves modality-specific features. Extensive experiments show that mPLUG-Owl2 generalizes both text tasks and multi-modal tasks, achieves state-of-the-art performances with a single generic model, and is the first MLLM to demonstrate the modality collaboration phenomenon in both pure-text and multi-modal scenarios.

What carries the argument

Modularized network design with shared functional modules for collaboration and a modality-adaptive module for preserving specific features, using the language decoder as universal interface.

If this is right

A single model can match or exceed specialist models on both text and multi-modal benchmarks.
Modality collaboration appears in pure-text scenarios as well as when images are present.
The design avoids the usual trade-offs when combining modalities in one architecture.
Future multi-modal foundation models can be built by extending this shared-module approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular pattern could be tested with additional modalities such as audio or video to check whether collaboration scales.
Training might become more sample-efficient because modules are reused rather than duplicated per modality.
If the collaboration effect holds at larger scales, it could reduce the need for separate text-only and multi-modal pretraining runs.

Load-bearing premise

The modularized design with shared modules creates real modality collaboration and the adaptive module avoids performance trade-offs between text and multi-modal tasks.

What would settle it

An ablation study that removes or replaces the shared functional modules and shows clear drops in performance on both pure-text and multi-modal benchmarks compared with the full model.

read the original abstract

Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

mPLUG-Owl2 splits MLLMs into shared modules for cross-modality collaboration and adaptive ones for specificity, claiming gains on both text and multi-modal tasks from one model, but the causal link to sharing remains unisolated.

read the letter

The main point is that mPLUG-Owl2 uses a modular network where shared functional modules let modalities interact and help each other, while a modality-adaptive module keeps features distinct. The language decoder serves as the common interface. They report that this single model improves on pure text tasks as well as multi-modal ones and that it is the first to show the collaboration effect even in text-only settings.

Referee Report

1 major / 2 minor

Summary. The paper introduces mPLUG-Owl2, a versatile multi-modal large language model that employs a modularized network design with the language decoder as a universal interface. It incorporates shared functional modules to promote modality collaboration and a modality-adaptive module to retain modality-specific features. The central claims are that this architecture enables generalization across both pure-text and multi-modal tasks, achieves state-of-the-art performance with a single generic model, and is the first MLLM to demonstrate the modality collaboration phenomenon in both scenarios.

Significance. If the performance gains and collaboration effects are causally linked to the shared modules rather than joint training or capacity increases, the work would offer a meaningful advance toward multi-modal foundation models that improve rather than trade off across modalities. The extension of collaboration benefits to pure-text tasks would be a notable empirical observation if isolated from confounds.

major comments (1)

[§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): the reported improvements on text and multi-modal benchmarks are presented as evidence of modality collaboration, yet no controlled ablation keeps total parameter count, training data, and optimization identical while disabling cross-modality sharing (e.g., by duplicating functional modules per modality). Without this isolation, the results cannot distinguish genuine collaboration from standard multi-task learning effects, which directly undermines the load-bearing claim that shared modules produce the observed collaboration phenomenon.

minor comments (2)

[Abstract] Abstract: the claim of 'extensive experiments' and 'state-of-the-art performances' is stated without naming the specific baselines, metrics, data splits, or significance tests used, reducing the reader's ability to evaluate the strength of the empirical support from the opening paragraph.
[§3] Notation and figures: the description of the modality-adaptive module would benefit from an explicit equation or diagram showing how it interacts with the shared modules during forward passes for both text-only and image-text inputs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the evidence for modality collaboration.

read point-by-point responses

Referee: [§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): the reported improvements on text and multi-modal benchmarks are presented as evidence of modality collaboration, yet no controlled ablation keeps total parameter count, training data, and optimization identical while disabling cross-modality sharing (e.g., by duplicating functional modules per modality). Without this isolation, the results cannot distinguish genuine collaboration from standard multi-task learning effects, which directly undermines the load-bearing claim that shared modules produce the observed collaboration phenomenon.

Authors: We agree that a strictly controlled ablation maintaining identical total parameter count, training data, and optimization while disabling cross-modality sharing would better isolate the collaboration effect from multi-task learning. Our Section 4.3 ablations already show performance drops when shared functional modules are ablated or replaced with modality-specific alternatives, providing supporting evidence. However, these do not fully match the referee's suggested control. In the revised manuscript, we will add a new experiment comparing the shared model against a variant with duplicated per-modality functional modules whose dimensions are scaled to preserve the exact total parameter count, trained on the same data with identical optimization. This will help distinguish genuine collaboration from confounds and strengthen the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and results

full rationale

The paper introduces mPLUG-Owl2 via a modularized network with shared functional modules and a modality-adaptive module, then reports generalization and SOTA performance from experiments. No mathematical derivation, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. Claims rest on observed empirical outcomes rather than self-definitional equivalences or self-citation chains, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard domain assumptions in multi-modal machine learning about the benefits of modular designs; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption A language decoder can function as a universal interface for managing inputs from different modalities.
Invoked in the description of the modularized network design.

pith-pipeline@v0.9.0 · 5739 in / 1197 out tokens · 33259 ms · 2026-05-18T03:15:00.864484+00:00 · methodology

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
cs.CL 2024-01 conditional novelty 6.0

Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
CogVLM: Visual Expert for Pretrained Language Models
cs.CV 2023-11 conditional novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
cs.CV 2026-04 unverdicted novelty 5.0

MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception
cs.CV 2026-04 unverdicted novelty 5.0

A CVAE-based Variational Information Flow module is proposed to counteract visual attenuation in MLLMs and improve fine-grained perception on VQA and grounding tasks.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 16 Pith papers · 37 internal anchors

[1]

http://sharegpt.com, 2023

Sharegpt. http://sharegpt.com, 2023. 2, 5, 17

work page 2023
[2]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,

work page
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

AnasAwadalla,IrenaGao,JoshGardner,JackHessel,Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, SamirGadre, ShioriSagawa, etal. Openflamingo: Anopen- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390, 2023. 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shĳie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: Afrontierlargevision-languagemodelwith versatile abilities.ArXiv, abs/2308.12966, 2023. 2, 5, 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, PranavShyam,GirishSastry,AmandaAskell,SandhiniAgar- wal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, ClemensWinter,ChristopherHesse,MarkChen,EricSigler, Mateusz Litwin, Scott Gra...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Coyo- 700m: Image-text pair dataset

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- 700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset, 2022. 5

work page 2022
[8]

End-to- end object detection with transformers

NicolasCarion,FranciscoMassa,GabrielSynnaeve,Nicolas Usunier,AlexanderKirillov,andSergeyZagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3

work page 2020
[9]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 5

work page 2021
[10]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.ArXiv, abs/2306.15195, 2023. 2, 5, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann,ParkerSchuh,KensenShi,SashaTsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, James Br...

work page
[13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoningchallenge. arXivpreprintarXiv:1803.05457 ,2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Opencompass: A univer- sal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023. 6

work page 2023
[15]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pas- cale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purposevision-languagemodelswithinstructiontun- ing. ArXiv, abs/2305.06500, 2023. 2, 3, 5, 6, 13, 14, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Xia, Mehdi S

Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...

work page 2023
[17]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

ChaoyouFu,PeixianChen,YunhangShen,YuleiQin,Meng- danZhang,XuLin,ZhenyuQiu,WeiLin,JinruiYang,Xiawu Zheng, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Koala: A dialogue model for academic research,

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 5

work page arXiv 2023
[19]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shĳie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyuYue,HongshengLi,andYuJiaoQiao. Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 5, 6, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Multimodal-gpt: A vision and lan- guage model for dialogue with humans

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and lan- guage model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023. 15

work page arXiv 2023
[21]

Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5, 13, 17

work page 2017
[22]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, MantasMazeika,DawnSong,andJacobSteinhardt. Measur- ingmassivemultitasklanguageunderstanding. arXivpreprint arXiv:2009.03300, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2009
[23]

Language Is Not All You Need: Aligning Perception with Language Models

ShaohanHuang,LiDong,WenhuiWang,YaruHao,Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and FuruWei. Languageisnotallyouneed: Aligningperception with language models.ArXiv, abs/2302.14045, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5, 13, 17

work page 2019
[25]

Tgif-qa: Toward spatio-temporal reasoning in visual question answering

Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition ,pages 2758–2766, 2017. 6

work page 2017
[26]

Visual genome: Connecting language and vision using crowdsourced dense imageannotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense imageannotations. Internationaljournalofcomputervision , 123:32–73, 2017. 5, 15, 17

work page 2017
[27]

Masked vision andlanguagemodelingformulti-modalrepresentationlearn- ing

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Er- han Bas, Rahul Bhotika, and Stefano Soatto. Masked vision andlanguagemodelingformulti-modalrepresentationlearn- ing. arXiv preprint arXiv:2208.02131, 2022. 2

work page arXiv 2022
[28]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web- scale filtered dataset of interleaved image-text documents,

work page
[29]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.ArXiv, abs/2305.03726,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi.Blip-2: Bootstrappinglanguage-imagepre-trainingwith frozen image encoders and large language models.ArXiv, abs/2301.12597, 2023. 2, 5, 6, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centricvideounderstanding. arXivpreprint arXiv:2305.06355, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in largevision-languagemodels. ArXiv,abs/2305.10355,2023. 13 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023

Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. 5, 17

work page 2023
[35]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 5, 13, 17

work page 2014
[36]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lĳuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.ArXiv, abs/2310.03744, 2023. 3, 5, 14, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Visual Instruction Tuning

HaotianLiu,ChunyuanLi,QingyangWu,andYongJaeLee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023. 2, 3, 4, 5, 13, 14, 15, 16, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

MMBench: Is Your Multi-modal Model an All-around Player?

YuanLiu,HaodongDuan,YuanhanZhang,BoLi,Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6, 13, 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Fixingweightdecayregu- larization in adam

IlyaLoshchilovandFrankHutter. Fixingweightdecayregu- larization in adam. 2018. 5

work page 2018
[41]

Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022. 5

work page arXiv 2022
[42]

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbazKhan.Video-chatgpt: Towardsdetailedvideo understanding via large vision and language models.ArXiv, abs/2306.05424, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 5, 13, 17

work page 2019
[44]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 international conference on document analysis and recognition (ICDAR), pages 947–

work page
[45]

5, 13, 17

IEEE, 2019. 5, 13, 17

work page 2019
[46]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. 2023. 1

work page 2023
[47]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.ArXiv, abs/2306.14824, 2023. 13, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, AmandaAskell,PamelaMishkin,JackClark,etal. Learning transferable visual models from natural language supervi- sion. InInternationalconferenceonmachinelearning ,pages 8748–8763. PMLR, 2021. 3, 5, 17

work page 2021
[50]

Laion-5b: Anopenlarge-scaledatasetfortraining next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man,etal. Laion-5b: Anopenlarge-scaledatasetfortraining next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 5

work page 2022
[51]

A-okvqa: Abench- mark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, KennethMarino,andRoozbehMottaghi. A-okvqa: Abench- mark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–

work page
[52]

5, 13, 17

Springer, 2022. 5, 13, 17

work page 2022
[53]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2002
[54]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. 15

work page internal anchor Pith review Pith/arXiv arXiv 1909
[55]

Textcaps: a dataset for image caption- ingwithreadingcomprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ingwithreadingcomprehension. In ComputerVision–ECCV 2020: 16thEuropeanConference,Glasgow,UK,August23– 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,

work page 2020
[56]

Aligning Large Multimodal Models with Factually Augmented RLHF

ZhiqingSun,ShengShen,ShengcaoCao,HaotianLiu,Chun- yuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- XiongWang,YimingYang,KurtKeutzer,andTrevorDarrell. Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525, 2023. 13, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowd- hery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-benchtasksandwhetherchain-of-thoughtcansolvethem. arXiv preprint arXiv:2210.09261, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanfordalpaca: Aninstruction-followingllama model. https://github.com/tatsu- lab/stanford_ alpaca, 2023. 2

work page 2023
[59]

LLaMA: Open and Efficient Foundation Language Models

HugoTouvron, ThibautLavril, Gautier Izacard, XavierMar- tinet,Marie-AnneLachaux,TimothéeLacroix,BaptisteRoz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert,AmjadAlmahairi,YasmineBabaei,NikolayBashlykov, SoumyaBatra,PrajjwalBhargava,ShrutiBhosale,DanielM. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, GuillemCucurull,DavidEsiobu,JudeFernandes,JeremyFu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, KevinLin,ZheGan,ZichengLiu,CeLiu,andLĳuanWang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Q-bench: A benchmark for general-purpose foundation models on low-level vision

Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023. 5, 6, 13, 16

work page arXiv 2023
[63]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex in- structions. ArXiv, abs/2304.12244, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 6

work page 2017
[65]

In InternationalConferenceon Machine Learning, 2023

Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, etal.mplug-2: Amodularizedmulti-modalfoundationmodel acrosstext,imageandvideo. In InternationalConferenceon Machine Learning, 2023. 2, 4

work page 2023
[66]

Zero-shot video question answering via frozen bidirectional language models

Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. InNeurIPS, 2022. 6

work page 2022
[67]

InEMNLP (Find- ings)

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, YuhaoDan,ChenlinZhao,GuohaiXu,ChenliangLi,Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-docowl: Modularizedmultimodallargelanguagemodelfordocument understanding. CoRR, abs/2307.02499, 2023. 2

work page arXiv 2023
[68]

Ureader: Universal ocr-free visually-situated language understandingwithmultimodallargelanguagemodel

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understandingwithmultimodallargelanguagemodel. In The 2023ConferenceonEmpiricalMethodsinNaturalLanguage Processing, 2023. 2

work page 2023
[69]

Hitea: Hierarchical temporal- aware video-language pre-training

Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal- aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023. 6

work page 2023
[70]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 2, 3, 4, 5, 15, 16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Modeling context in referring expres- sions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016. 5, 15, 17

work page 2016
[72]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lĳuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXivpreprintarXiv:2308.02490 ,2023. 5,6,16

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv, abs/2306.02858, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Svit: Scaling up visual instruction tuning

Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning.arXiv preprint arXiv:2307.04087,

work page arXiv
[75]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gon- zalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 6

work page 2023
[76]

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, ShuaiLu,YanlinWang,AminSaied,WeizhuChen,andNan Duan. Agieval: A human-centric benchmark for evaluating foundation models.arXiv preprint arXiv:2304.06364, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.ArXiv, abs/2304.10592, 2023. 2, 3, 5, 14, 15, 16 12 A. Additional Experimental Results Inthissection,weprovidemoreexperimentalresultsforthe completeness of our proposed method. A.1. Hallucination Eva...

work page internal anchor Pith review Pith/arXiv arXiv 2023