pith. machine review for the scientific record. sign in

arxiv: 2401.16420 · v1 · submitted 2024-01-29 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modeltext-image compositionpartial LoRAmultimodal generationfree-form content creationInternLM-XComposer2interleaved text and images
0
0 comments X

The pith

InternLM-XComposer2 generates custom interleaved text-image content by applying LoRA parameters only to image tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternLM-XComposer2 as a vision-language model that creates free-form text-image compositions from inputs such as outlines, detailed text, or reference images. It introduces a Partial LoRA method that adds adaptation parameters solely to image tokens while leaving the pre-trained language model untouched. This design aims to deliver both accurate vision understanding and fluent, high-quality text composition in long multimodal outputs. The model, based on a 7B InternLM2 backbone, is shown to exceed prior multimodal systems on benchmarks and to match or exceed GPT-4V and Gemini Pro on selected tasks. The central argument is that selective adaptation of vision components enables strong multimodal generation without sacrificing linguistic capability.

Core claim

InternLM-XComposer2 demonstrates that applying additional LoRA parameters exclusively to image tokens produces a model capable of high-quality free-form text-image composition and comprehension, outperforming existing multimodal models and matching or surpassing GPT-4V and Gemini Pro on certain benchmarks while preserving the integrity of the pre-trained language knowledge.

What carries the argument

Partial LoRA (PLoRA) that applies LoRA parameters exclusively to image tokens to balance precise vision understanding with literary-quality text composition.

If this is right

  • The model can produce long, interleaved multimodal documents from outlines or reference images.
  • Vision-language understanding reaches or exceeds GPT-4V and Gemini Pro levels on selected evaluations.
  • High-quality content creation becomes possible without full fine-tuning of the language backbone.
  • The same PLoRA pattern may extend to other base language models of similar size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selective tuning of vision components could reduce the risk of language degradation seen in full multimodal fine-tuning.
  • This separation of adaptation might allow smaller teams to build capable multimodal systems on top of existing open language models.
  • Testing PLoRA on tasks that require very long context or creative writing would clarify how far the preserved language skill extends.

Load-bearing premise

Adding LoRA parameters only to image tokens preserves the original language model's knowledge while still enabling strong vision understanding and text-image generation.

What would settle it

A measurable drop in performance on pure language-only benchmarks after PLoRA training would show that language knowledge was not preserved.

read the original abstract

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents InternLM-XComposer2, a 7B-parameter vision-language model built on InternLM2 that introduces Partial LoRA (PLoRA) to apply additional LoRA parameters exclusively to image tokens. This design is claimed to preserve the base model's pre-trained language knowledge while enabling high-quality free-form interleaved text-image generation and comprehension from inputs such as outlines, textual specifications, and reference images. The manuscript reports that the model significantly outperforms prior multimodal systems and matches or exceeds GPT-4V and Gemini Pro on selected vision-language benchmarks, with the model weights publicly released.

Significance. If the central performance claims and the PLoRA preservation hypothesis are substantiated, the work would be significant for providing a lightweight, modular route to extend strong language models into multimodal composition tasks without full fine-tuning. The public release of the 7B model series would further enable reproducible research on controllable text-image generation.

major comments (1)
  1. [§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.
minor comments (1)
  1. [Abstract] Abstract and experimental section: The superiority claims reference various benchmarks but provide no details on data splits, evaluation protocols, statistical significance, or exact metric definitions, making it difficult to assess the strength of the reported gains over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.

    Authors: We appreciate the referee highlighting the need for stronger empirical support. The PLoRA design applies LoRA updates exclusively to image tokens while keeping base InternLM2 weights frozen for text tokens, which is intended to limit interference with pre-trained language abilities. We agree that direct ablations would strengthen the manuscript. In the revision we will add language-only benchmark results (MMLU, GSM8K) comparing the original InternLM2-7B to the PLoRA-adapted model to quantify preservation. We will also include a side-by-side comparison of PLoRA versus standard LoRA applied to all tokens, showing advantages for text composition quality. Regarding modality mixing through attention layers, although cross-modal interactions exist, the position-specific LoRA application ensures that core language parameters and the modeling head for pure text sequences remain unchanged, which is consistent with the observed high-quality long-text generation performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces InternLM-XComposer2 with a Partial LoRA (PLoRA) mechanism applied selectively to image tokens on top of InternLM2-7B. Central claims of superior free-form text-image composition and comprehension are supported by reported experimental results on various benchmarks, including direct comparisons to GPT-4V and Gemini Pro. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported outcomes to the paper's own inputs by construction appear in the provided text. The approach is presented as an architectural proposal with empirical validation against independent external references.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of Partial LoRA for modality balance, an approach chosen without detailed theoretical derivation in the provided abstract.

free parameters (1)
  • Partial LoRA rank and scaling
    Hyperparameters controlling the added adaptation matrices applied only to image tokens; values chosen to preserve language capabilities.

pith-pipeline@v0.9.0 · 5577 in / 1080 out tokens · 51771 ms · 2026-05-17T05:24:56.204955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  3. CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

  4. We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

    cs.AI 2024-07 accept novelty 7.0

    WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

  5. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  6. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  7. Towards Design Compositing

    cs.CV 2026-04 unverdicted novelty 6.0

    GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.

  8. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    cs.CV 2025-01 conditional novelty 6.0

    Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

  9. LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    cs.CV 2024-10 unverdicted novelty 6.0

    LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...

  10. BLINK: Multimodal Large Language Models Can See but Not Perceive

    cs.CV 2024-04 accept novelty 6.0

    BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.

  11. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  12. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  13. Less Detail, Better Answers: Degradation-Driven Prompting for VQA

    cs.CV 2026-04 unverdicted novelty 5.0

    Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.

  14. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  15. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  16. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    cs.CV 2024-03 unverdicted novelty 5.0

    Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

  17. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  18. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  19. DeepSeek-VL: Towards Real-World Vision-Language Understanding

    cs.AI 2024-03 unverdicted novelty 4.0

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 19 Pith papers · 16 internal anchors

  1. [1]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 8948–8957,

  2. [2]

    Flamingo: a visual language model for few-shot learning,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

  3. [3]

    MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel- Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solv- ing with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019. 4

  4. [4]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4

  5. [5]

    Openflamingo: An open- source framework for training large autoregressive vision- language models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3

  6. [6]

    Qwen-vl: A frontier large vision-language model with versatile abilities

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 2, 3

  7. [7]

    Baichuan 2: Open large-scale language models

    Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3

  8. [8]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2

  9. [9]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

  10. [10]

    Shikra: Unleashing multimodal llm’s referential dialogue magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3

  11. [11]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 3, 4 7

  12. [12]

    Pali-x: On scaling up a multilingual vision and language model, 2023

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

  13. [13]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4

  14. [14]

    Pali-3 vision language models: Smaller, faster, stronger, 2023

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3

  15. [15]

    Pali: A jointly-scaled multilingual language- image model, 2023

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

  16. [16]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 4

  17. [17]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2

  18. [18]

    Opencompass: A univer- sal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

  19. [19]

    Qwen-vl-plus

    QWen Contributors. Qwen-vl-plus. https : / / huggingface . co / spaces / Qwen / Qwen - VL - Plus, year=2023. 2

  20. [20]

    Xtuner: A toolkit for efficiently fine-tuning llm

    XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023. 3

  21. [21]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  22. [22]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 2

  23. [23]

    Dreamllm: Synergistic multimodal com- prehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

  24. [24]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  25. [25]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 320–335, 2022. 2

  26. [26]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3

  27. [27]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 5

  28. [28]

    A challenger to gpt-4v? early explorations of gemini in visual expertise

    Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 3

  29. [29]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 2

  30. [30]

    Planting a seed of vision in large language model

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3

  31. [31]

    Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023. 2, 5

  32. [32]

    Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

    Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- 8 juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

  33. [33]

    LoRA: Low-rank adaptation of large language mod- els

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 2, 3

  34. [34]

    Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

    Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911,

  35. [35]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

  36. [36]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3

  37. [37]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 2

  38. [38]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 2

  39. [39]

    Dvqa: Understanding data visualizations via ques- tion answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,

  40. [40]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–

  41. [41]

    Springer, 2016. 2, 4, 5

  42. [42]

    Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 5

  43. [43]

    Otter: A multi-modal model with in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3

  44. [44]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3

  45. [45]

    Grounded language-image pre-training

    Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  46. [46]

    Evaluating object hallucination in large vision-language models, 2023

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 2, 5

  47. [47]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3

  48. [48]

    Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning

    Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 4

  49. [49]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2

  50. [50]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 5

  51. [51]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3

  52. [52]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 5

  53. [53]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. In International Conference on Learning Represen- tations (ICLR), 2024. 2, 5

  54. [54]

    Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021. 4

  55. [55]

    Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4

  56. [56]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4

  57. [57]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 2, 4, 5 9

  58. [58]

    OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2

  59. [59]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 1, 2, 3

  60. [60]

    Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 4

  61. [61]

    Training language models to follow instructions with human feed- back

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 2

  62. [62]

    The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3

  63. [63]

    Kosmos-2: Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,

  64. [64]

    Gpt4point: A unified framework for point-language under- standing and generation, 2023

    Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language under- standing and generation, 2023. 3

  65. [65]

    Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases, 2023

    Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, and Hengshuang Zhao. Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases, 2023. 3

  66. [66]

    Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023

    Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2

  67. [67]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 3

  68. [68]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2

  69. [69]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2

  70. [70]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 4

  71. [71]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4

  72. [72]

    Kvqa: Knowledge-aware visual question answering

    Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, 2019. 4

  73. [73]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4

  74. [74]

    Textcaps: a dataset for image caption- ing with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4

  75. [75]

    Generative pretraining in mul- timodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in mul- timodality. Jul 2023. 3

  76. [76]

    Alpha-CLIP: A clip model focusing on wherever you want

    Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-CLIP: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023. 3

  77. [77]

    Gemini: A family of highly capable multi- modal models, 2023

    Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 1, 2

  78. [78]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 1, 2, 4

  79. [79]

    Llama: Open and efficient foundation language mod- els

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language mod- els. arXiv.org, 2023. 2

  80. [80]

    Llama 2: Open foundation and fine-tuned chat models,

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models,

Showing first 80 references.