mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Pith reviewed 2026-05-18 03:15 UTC · model grok-4.3
The pith
mPLUG-Owl2 reaches state-of-the-art on text and multi-modal tasks by using shared modules for modality collaboration in one model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
mPLUG-Owl2 effectively leverages modality collaboration to improve performance in both text and multi-modal tasks through a modularized network design with the language decoder acting as a universal interface. Shared functional modules facilitate modality collaboration while a modality-adaptive module preserves modality-specific features. Extensive experiments show that mPLUG-Owl2 generalizes both text tasks and multi-modal tasks, achieves state-of-the-art performances with a single generic model, and is the first MLLM to demonstrate the modality collaboration phenomenon in both pure-text and multi-modal scenarios.
What carries the argument
Modularized network design with shared functional modules for collaboration and a modality-adaptive module for preserving specific features, using the language decoder as universal interface.
If this is right
- A single model can match or exceed specialist models on both text and multi-modal benchmarks.
- Modality collaboration appears in pure-text scenarios as well as when images are present.
- The design avoids the usual trade-offs when combining modalities in one architecture.
- Future multi-modal foundation models can be built by extending this shared-module approach.
Where Pith is reading between the lines
- The same modular pattern could be tested with additional modalities such as audio or video to check whether collaboration scales.
- Training might become more sample-efficient because modules are reused rather than duplicated per modality.
- If the collaboration effect holds at larger scales, it could reduce the need for separate text-only and multi-modal pretraining runs.
Load-bearing premise
The modularized design with shared modules creates real modality collaboration and the adaptive module avoids performance trade-offs between text and multi-modal tasks.
What would settle it
An ablation study that removes or replaces the shared functional modules and shows clear drops in performance on both pure-text and multi-modal benchmarks compared with the full model.
read the original abstract
Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces mPLUG-Owl2, a versatile multi-modal large language model that employs a modularized network design with the language decoder as a universal interface. It incorporates shared functional modules to promote modality collaboration and a modality-adaptive module to retain modality-specific features. The central claims are that this architecture enables generalization across both pure-text and multi-modal tasks, achieves state-of-the-art performance with a single generic model, and is the first MLLM to demonstrate the modality collaboration phenomenon in both scenarios.
Significance. If the performance gains and collaboration effects are causally linked to the shared modules rather than joint training or capacity increases, the work would offer a meaningful advance toward multi-modal foundation models that improve rather than trade off across modalities. The extension of collaboration benefits to pure-text tasks would be a notable empirical observation if isolated from confounds.
major comments (1)
- [§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): the reported improvements on text and multi-modal benchmarks are presented as evidence of modality collaboration, yet no controlled ablation keeps total parameter count, training data, and optimization identical while disabling cross-modality sharing (e.g., by duplicating functional modules per modality). Without this isolation, the results cannot distinguish genuine collaboration from standard multi-task learning effects, which directly undermines the load-bearing claim that shared modules produce the observed collaboration phenomenon.
minor comments (2)
- [Abstract] Abstract: the claim of 'extensive experiments' and 'state-of-the-art performances' is stated without naming the specific baselines, metrics, data splits, or significance tests used, reducing the reader's ability to evaluate the strength of the empirical support from the opening paragraph.
- [§3] Notation and figures: the description of the modality-adaptive module would benefit from an explicit equation or diagram showing how it interacts with the shared modules during forward passes for both text-only and image-text inputs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the evidence for modality collaboration.
read point-by-point responses
-
Referee: [§4 and §4.3] §4 (Experiments) and §4.3 (Ablation Studies): the reported improvements on text and multi-modal benchmarks are presented as evidence of modality collaboration, yet no controlled ablation keeps total parameter count, training data, and optimization identical while disabling cross-modality sharing (e.g., by duplicating functional modules per modality). Without this isolation, the results cannot distinguish genuine collaboration from standard multi-task learning effects, which directly undermines the load-bearing claim that shared modules produce the observed collaboration phenomenon.
Authors: We agree that a strictly controlled ablation maintaining identical total parameter count, training data, and optimization while disabling cross-modality sharing would better isolate the collaboration effect from multi-task learning. Our Section 4.3 ablations already show performance drops when shared functional modules are ablated or replaced with modality-specific alternatives, providing supporting evidence. However, these do not fully match the referee's suggested control. In the revised manuscript, we will add a new experiment comparing the shared model against a variant with duplicated per-modality functional modules whose dimensions are scaled to preserve the exact total parameter count, trained on the same data with identical optimization. This will help distinguish genuine collaboration from confounds and strengthen the central claims. revision: yes
Circularity Check
No circularity: empirical architecture and results
full rationale
The paper introduces mPLUG-Owl2 via a modularized network with shared functional modules and a modality-adaptive module, then reports generalization and SOTA performance from experiments. No mathematical derivation, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. Claims rest on observed empirical outcomes rather than self-definitional equivalences or self-citation chains, rendering the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A language decoder can function as a universal interface for managing inputs from different modalities.
Forward citations
Cited by 16 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Mobile-Agent is a vision-centric autonomous agent that uses MLLMs to perceive UI elements, plan complex multi-step tasks, and operate mobile apps without relying on XML or system metadata, showing strong results on th...
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation
MPD reduces hallucinations in LVLMs by 23.4% while retaining 97.4% of general capability through semantic disentanglement and selective parameter updates.
-
From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception
A CVAE-based Variational Information Flow module is proposed to counteract visual attenuation in MLLMs and improve fine-grained perception on VQA and grounding tasks.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
- [1]
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35:23716–23736,
-
[3]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
AnasAwadalla,IrenaGao,JoshGardner,JackHessel,Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, SamirGadre, ShioriSagawa, etal. Openflamingo: Anopen- source framework for training large autoregressive vision- language models. arXiv preprint arXiv:2308.01390, 2023. 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: Afrontierlargevision-languagemodelwith versatile abilities.ArXiv, abs/2308.12966, 2023. 2, 5, 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, PranavShyam,GirishSastry,AmandaAskell,SandhiniAgar- wal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, ClemensWinter,ChristopherHesse,MarkChen,EricSigler, Mateusz Litwin, Scott Gra...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Coyo- 700m: Image-text pair dataset
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo- 700m: Image-text pair dataset. https://github.com/ kakaobrain/coyo-dataset, 2022. 5
work page 2022
-
[8]
End-to- end object detection with transformers
NicolasCarion,FranciscoMassa,GabrielSynnaeve,Nicolas Usunier,AlexanderKirillov,andSergeyZagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 2, 3
work page 2020
-
[9]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 5
work page 2021
-
[10]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Ke Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.ArXiv, abs/2306.15195, 2023. 2, 5, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly- scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann,ParkerSchuh,KensenShi,SashaTsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Benton C. Hutchinson, Reiner Pope, James Br...
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoningchallenge. arXivpreprintarXiv:1803.05457 ,2018. 6
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Opencompass: A univer- sal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023. 6
work page 2023
-
[15]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pas- cale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purposevision-languagemodelswithinstructiontun- ing. ArXiv, abs/2305.06500, 2023. 2, 3, 5, 6, 13, 14, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Danny Driess, F. Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Ho Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Peter R. Florence. Palm-e: An e...
work page 2023
-
[17]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
ChaoyouFu,PeixianChen,YunhangShen,YuleiQin,Meng- danZhang,XuLin,ZhenyuQiu,WeiLin,JinruiYang,Xiawu Zheng, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Koala: A dialogue model for academic research,
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- acomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023. 5
-
[19]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyuYue,HongshengLi,andYuJiaoQiao. Llama-adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 5, 6, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Multimodal-gpt: A vision and lan- guage model for dialogue with humans
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and lan- guage model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023. 15
-
[21]
Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the V in VQA matter: Ele- vating the role of image understanding in Visual Question Answering. InConference on Computer Vision and Pattern Recognition (CVPR), 2017. 5, 13, 17
work page 2017
-
[22]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, MantasMazeika,DawnSong,andJacobSteinhardt. Measur- ingmassivemultitasklanguageunderstanding. arXivpreprint arXiv:2009.03300, 2020. 6
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[23]
Language Is Not All You Need: Aligning Perception with Language Models
ShaohanHuang,LiDong,WenhuiWang,YaruHao,Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and FuruWei. Languageisnotallyouneed: Aligningperception with language models.ArXiv, abs/2302.14045, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5, 13, 17
work page 2019
-
[25]
Tgif-qa: Toward spatio-temporal reasoning in visual question answering
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InProceedings of the IEEE conferenceoncomputervisionandpatternrecognition ,pages 2758–2766, 2017. 6
work page 2017
-
[26]
Visual genome: Connecting language and vision using crowdsourced dense imageannotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense imageannotations. Internationaljournalofcomputervision , 123:32–73, 2017. 5, 15, 17
work page 2017
-
[27]
Masked vision andlanguagemodelingformulti-modalrepresentationlearn- ing
Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Er- han Bas, Rahul Bhotika, and Stefano Soatto. Masked vision andlanguagemodelingformulti-modalrepresentationlearn- ing. arXiv preprint arXiv:2208.02131, 2022. 2
-
[28]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web- scale filtered dataset of interleaved image-text documents,
-
[29]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.ArXiv, abs/2305.03726,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi.Blip-2: Bootstrappinglanguage-imagepre-trainingwith frozen image encoders and large language models.ArXiv, abs/2301.12597, 2023. 2, 5, 6, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centricvideounderstanding. arXivpreprint arXiv:2305.06355, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji rong Wen. Evaluating object hallucination in largevision-languagemodels. ArXiv,abs/2305.10355,2023. 13 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023
Wing Lian, Guan Wang, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. 5, 17
work page 2023
-
[35]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014. 5, 13, 17
work page 2014
-
[36]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.ArXiv, abs/2310.03744, 2023. 3, 5, 14, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
HaotianLiu,ChunyuanLi,QingyangWu,andYongJaeLee. Visual instruction tuning.ArXiv, abs/2304.08485, 2023. 2, 3, 4, 5, 13, 14, 15, 16, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
MMBench: Is Your Multi-modal Model an All-around Player?
YuanLiu,HaodongDuan,YuanhanZhang,BoLi,Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023. 5, 6, 13, 15
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Fixingweightdecayregu- larization in adam
IlyaLoshchilovandFrankHutter. Fixingweightdecayregu- larization in adam. 2018. 5
work page 2018
-
[41]
Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mot- taghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks.ArXiv, abs/2206.08916, 2022. 5
-
[42]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and FahadShahbazKhan.Video-chatgpt: Towardsdetailedvideo understanding via large vision and language models.ArXiv, abs/2306.05424, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 5, 13, 17
work page 2019
-
[44]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In2019 international conference on document analysis and recognition (ICDAR), pages 947–
- [45]
- [46]
-
[47]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.ArXiv, abs/2306.14824, 2023. 13, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, AmandaAskell,PamelaMishkin,JackClark,etal. Learning transferable visual models from natural language supervi- sion. InInternationalconferenceonmachinelearning ,pages 8748–8763. PMLR, 2021. 3, 5, 17
work page 2021
-
[50]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man,etal. Laion-5b: Anopenlarge-scaledatasetfortraining next generation image-text models.Advances in Neural In- formation Processing Systems, 35:25278–25294, 2022. 5
work page 2022
-
[51]
A-okvqa: Abench- mark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, KennethMarino,andRoozbehMottaghi. A-okvqa: Abench- mark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–
- [52]
-
[53]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[54]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019. 15
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[55]
Textcaps: a dataset for image caption- ingwithreadingcomprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ingwithreadingcomprehension. In ComputerVision–ECCV 2020: 16thEuropeanConference,Glasgow,UK,August23– 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,
work page 2020
-
[56]
Aligning Large Multimodal Models with Factually Augmented RLHF
ZhiqingSun,ShengShen,ShengcaoCao,HaotianLiu,Chun- yuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu- XiongWang,YimingYang,KurtKeutzer,andTrevorDarrell. Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525, 2023. 13, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowd- hery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-benchtasksandwhetherchain-of-thoughtcansolvethem. arXiv preprint arXiv:2210.09261, 2022. 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [58]
-
[59]
LLaMA: Open and Efficient Foundation Language Models
HugoTouvron, ThibautLavril, Gautier Izacard, XavierMar- tinet,Marie-AnneLachaux,TimothéeLacroix,BaptisteRoz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Al- bert,AmjadAlmahairi,YasmineBabaei,NikolayBashlykov, SoumyaBatra,PrajjwalBhargava,ShrutiBhosale,DanielM. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, GuillemCucurull,DavidEsiobu,JudeFernandes,JeremyFu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, KevinLin,ZheGan,ZichengLiu,CeLiu,andLijuanWang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Q-bench: A benchmark for general-purpose foundation models on low-level vision
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023. 5, 6, 13, 16
-
[63]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex in- structions. ArXiv, abs/2304.12244, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Video question answer- ing via gradually refined attention over appearance and mo- tion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. In Proceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 6
work page 2017
-
[65]
In InternationalConferenceon Machine Learning, 2023
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, etal.mplug-2: Amodularizedmulti-modalfoundationmodel acrosstext,imageandvideo. In InternationalConferenceon Machine Learning, 2023. 2, 4
work page 2023
-
[66]
Zero-shot video question answering via frozen bidirectional language models
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models. InNeurIPS, 2022. 6
work page 2022
-
[67]
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, YuhaoDan,ChenlinZhao,GuohaiXu,ChenliangLi,Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-docowl: Modularizedmultimodallargelanguagemodelfordocument understanding. CoRR, abs/2307.02499, 2023. 2
-
[68]
Ureader: Universal ocr-free visually-situated language understandingwithmultimodallargelanguagemodel
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understandingwithmultimodallargelanguagemodel. In The 2023ConferenceonEmpiricalMethodsinNaturalLanguage Processing, 2023. 2
work page 2023
-
[69]
Hitea: Hierarchical temporal- aware video-language pre-training
Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal- aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023. 6
work page 2023
-
[70]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality.arXiv preprint arXiv:2304.14178, 2023. 2, 3, 4, 5, 15, 16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Modeling context in referring expres- sions
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016. 5, 15, 17
work page 2016
-
[72]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXivpreprintarXiv:2308.02490 ,2023. 5,6,16
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding. ArXiv, abs/2306.02858, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Svit: Scaling up visual instruction tuning
Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning.arXiv preprint arXiv:2307.04087,
-
[75]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gon- zalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. 6
work page 2023
-
[76]
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, ShuaiLu,YanlinWang,AminSaied,WeizhuChen,andNan Duan. Agieval: A human-centric benchmark for evaluating foundation models.arXiv preprint arXiv:2304.06364, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.ArXiv, abs/2304.10592, 2023. 2, 3, 5, 14, 15, 16 12 A. Additional Experimental Results Inthissection,weprovidemoreexperimentalresultsforthe completeness of our proposed method. A.1. Hallucination Eva...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.