arxiv: 2401.16420 · v1 · submitted 2024-01-29 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong , Pan Zhang , Yuhang Zang , Yuhang Cao , Bin Wang , Linke Ouyang , Xilin Wei , Songyang Zhang

show 15 more authors

Haodong Duan Maosong Cao Wenwei Zhang Yining Li Hang Yan Yang Gao Xinyue Zhang Wei Li Jingwen Li Kai Chen Conghui He Xingcheng Zhang Yu Qiao Dahua Lin Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords vision-language modeltext-image compositionpartial LoRAmultimodal generationfree-form content creationInternLM-XComposer2interleaved text and images

0 comments

The pith

InternLM-XComposer2 generates custom interleaved text-image content by applying LoRA parameters only to image tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternLM-XComposer2 as a vision-language model that creates free-form text-image compositions from inputs such as outlines, detailed text, or reference images. It introduces a Partial LoRA method that adds adaptation parameters solely to image tokens while leaving the pre-trained language model untouched. This design aims to deliver both accurate vision understanding and fluent, high-quality text composition in long multimodal outputs. The model, based on a 7B InternLM2 backbone, is shown to exceed prior multimodal systems on benchmarks and to match or exceed GPT-4V and Gemini Pro on selected tasks. The central argument is that selective adaptation of vision components enables strong multimodal generation without sacrificing linguistic capability.

Core claim

InternLM-XComposer2 demonstrates that applying additional LoRA parameters exclusively to image tokens produces a model capable of high-quality free-form text-image composition and comprehension, outperforming existing multimodal models and matching or surpassing GPT-4V and Gemini Pro on certain benchmarks while preserving the integrity of the pre-trained language knowledge.

What carries the argument

Partial LoRA (PLoRA) that applies LoRA parameters exclusively to image tokens to balance precise vision understanding with literary-quality text composition.

If this is right

The model can produce long, interleaved multimodal documents from outlines or reference images.
Vision-language understanding reaches or exceeds GPT-4V and Gemini Pro levels on selected evaluations.
High-quality content creation becomes possible without full fine-tuning of the language backbone.
The same PLoRA pattern may extend to other base language models of similar size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Selective tuning of vision components could reduce the risk of language degradation seen in full multimodal fine-tuning.
This separation of adaptation might allow smaller teams to build capable multimodal systems on top of existing open language models.
Testing PLoRA on tasks that require very long context or creative writing would clarify how far the preserved language skill extends.

Load-bearing premise

Adding LoRA parameters only to image tokens preserves the original language model's knowledge while still enabling strong vision understanding and text-image generation.

What would settle it

A measurable drop in performance on pure language-only benchmarks after PLoRA training would show that language knowledge was not preserved.

read the original abstract

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternLM-XComposer2 adds Partial LoRA only to image tokens to extend InternLM2-7B for text-image tasks, but the preservation of language knowledge lacks supporting ablations.

read the letter

The main takeaway is that InternLM-XComposer2 introduces Partial LoRA to adapt a language model for vision tasks by updating only image tokens, aiming to keep language abilities strong. This seems like a useful tweak for building better multimodal systems. What the paper does is take the InternLM2-7B base and apply extra LoRA parameters selectively to image tokens. This lets the model handle free-form text-image composition, like creating content from outlines, text specs, or reference images. It reports strong results on benchmarks, often beating other multimodal models and holding up against GPT-4V and Gemini Pro in some tests. The focus on long-text multi-modal output is practical for content generation applications. The new contribution is this Partial LoRA method, which is a targeted version of standard adaptation techniques. It works well in showing how to balance vision understanding with text composition without full retraining. The soft spots come in the evidence for the central idea. The claim that this approach preserves pre-trained language knowledge rests on the architecture but lacks direct checks. There are no reported ablations on language-only benchmarks like MMLU before and after the adaptation, or comparisons to applying LoRA across all tokens. Cross-attention still connects the modalities, so it's not obvious that the selective updates are what maintain the language performance. The abstract gives high-level performance wins without details on controls or stats, which makes the superiority claims harder to assess fully. This paper is aimed at researchers and engineers working on vision-language models, especially those looking for efficient ways to extend existing LLMs. Readers who need ideas for multimodal content tools or adaptation methods will find value in the approach and results. It deserves a serious referee because the method is concrete and the benchmark comparisons provide something to evaluate, even with the gaps in ablations. I would recommend sending it to peer review, with feedback to add those language preservation tests and more experimental details.

Referee Report

1 major / 1 minor

Summary. The paper presents InternLM-XComposer2, a 7B-parameter vision-language model built on InternLM2 that introduces Partial LoRA (PLoRA) to apply additional LoRA parameters exclusively to image tokens. This design is claimed to preserve the base model's pre-trained language knowledge while enabling high-quality free-form interleaved text-image generation and comprehension from inputs such as outlines, textual specifications, and reference images. The manuscript reports that the model significantly outperforms prior multimodal systems and matches or exceeds GPT-4V and Gemini Pro on selected vision-language benchmarks, with the model weights publicly released.

Significance. If the central performance claims and the PLoRA preservation hypothesis are substantiated, the work would be significant for providing a lightweight, modular route to extend strong language models into multimodal composition tasks without full fine-tuning. The public release of the 7B model series would further enable reproducible research on controllable text-image generation.

major comments (1)

[§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.

minor comments (1)

[Abstract] Abstract and experimental section: The superiority claims reference various benchmarks but provide no details on data splits, evaluation protocols, statistical significance, or exact metric definitions, making it difficult to assess the strength of the reported gains over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and describe the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.

Authors: We appreciate the referee highlighting the need for stronger empirical support. The PLoRA design applies LoRA updates exclusively to image tokens while keeping base InternLM2 weights frozen for text tokens, which is intended to limit interference with pre-trained language abilities. We agree that direct ablations would strengthen the manuscript. In the revision we will add language-only benchmark results (MMLU, GSM8K) comparing the original InternLM2-7B to the PLoRA-adapted model to quantify preservation. We will also include a side-by-side comparison of PLoRA versus standard LoRA applied to all tokens, showing advantages for text composition quality. Regarding modality mixing through attention layers, although cross-modal interactions exist, the position-specific LoRA application ensures that core language parameters and the modeling head for pure text sequences remain unchanged, which is consistent with the observed high-quality long-text generation performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces InternLM-XComposer2 with a Partial LoRA (PLoRA) mechanism applied selectively to image tokens on top of InternLM2-7B. Central claims of superior free-form text-image composition and comprehension are supported by reported experimental results on various benchmarks, including direct comparisons to GPT-4V and Gemini Pro. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported outcomes to the paper's own inputs by construction appear in the provided text. The approach is presented as an architectural proposal with empirical validation against independent external references.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of Partial LoRA for modality balance, an approach chosen without detailed theoretical derivation in the provided abstract.

free parameters (1)

Partial LoRA rank and scaling
Hyperparameters controlling the added adaptation matrices applied only to image tokens; values chosen to preserve language capabilities.

pith-pipeline@v0.9.0 · 5577 in / 1080 out tokens · 51771 ms · 2026-05-17T05:24:56.204955+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
cs.AI 2024-07 accept novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Towards Design Compositing
cs.CV 2026-04 unverdicted novelty 6.0

GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
cs.CV 2025-01 conditional novelty 6.0

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
cs.CV 2024-10 unverdicted novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
BLINK: Multimodal Large Language Models Can See but Not Perceive
cs.CV 2024-04 accept novelty 6.0

BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
cs.CV 2026-04 unverdicted novelty 5.0

Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
cs.CV 2024-03 unverdicted novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...

Reference graph

Works this paper leans on

105 extracted references · 105 canonical work pages · cited by 19 Pith papers · 16 internal anchors

[1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 8948–8957,

work page
[2]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page
[3]

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel- Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solv- ing with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019. 4

work page internal anchor Pith review Pith/arXiv arXiv 1905
[4]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4

work page 2015
[5]

Openflamingo: An open- source framework for training large autoregressive vision- language models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3

work page 2023
[6]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 2, 3

work page 2023
[7]

Baichuan 2: Open large-scale language models

Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3

work page 2023
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2

work page 1901
[9]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Shikra: Unleashing multimodal llm’s referential dialogue magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3

work page 2023
[11]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 3, 4 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Pali-x: On scaling up a multilingual vision and language model, 2023

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

work page 2023
[13]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4

work page 2015
[14]

Pali-3 vision language models: Smaller, faster, stronger, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3

work page 2023
[15]

Pali: A jointly-scaled multilingual language- image model, 2023

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

work page 2023
[16]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 4

work page 2023
[17]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2

work page 2022
[18]

Opencompass: A univer- sal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

work page
[19]

Qwen-vl-plus

QWen Contributors. Qwen-vl-plus. https : / / huggingface . co / spaces / Qwen / Qwen - VL - Plus, year=2023. 2

work page 2023
[20]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023. 3

work page 2023
[21]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[22]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 2

work page 2018
[23]

Dreamllm: Synergistic multimodal com- prehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[24]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 320–335, 2022. 2

work page 2022
[26]

Eva: Exploring the limits of masked visual represen- tation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3

work page 2023
[27]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

A challenger to gpt-4v? early explorations of gemini in visual expertise

Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 3

work page arXiv 2023
[29]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Planting a seed of vision in large language model

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3

work page
[31]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023. 2, 5

work page 2023
[32]

Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- 8 juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

work page arXiv
[33]

LoRA: Low-rank adaptation of large language mod- els

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 2, 3

work page 2022
[34]

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911,

work page arXiv
[35]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

work page 2019
[36]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3

work page 2021
[37]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 2

work page 2023
[38]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Dvqa: Understanding data visualizations via ques- tion answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,

work page
[40]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–

work page 2016
[41]

Springer, 2016. 2, 4, 5

work page 2016
[42]

Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 5

work page 2023
[43]

Otter: A multi-modal model with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3

work page 2023
[44]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3

work page 2022
[45]

Grounded language-image pre-training

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[46]

Evaluating object hallucination in large vision-language models, 2023

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 2, 5

work page 2023
[47]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3

work page arXiv 2023
[48]

Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning

Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 4

work page arXiv 2023
[49]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 5

work page 2023
[51]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3

work page 2023
[52]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. In International Conference on Learning Represen- tations (ICLR), 2024. 2, 5

work page 2024
[54]

Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021. 4

work page 2021
[55]

Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4

work page 2022
[56]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4

work page 2019
[57]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 2, 4, 5 9

work page internal anchor Pith review Pith/arXiv arXiv 2022
[58]

OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2

work page 2022
[59]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1, 2, 3

work page 2023
[60]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 4

work page 2011
[61]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 2

work page 2022
[62]

The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3

work page 2023
[63]

Kosmos-2: Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,

work page
[64]

Gpt4point: A unified framework for point-language under- standing and generation, 2023

Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language under- standing and generation, 2023. 3

work page 2023
[65]

Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases, 2023

Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, and Hengshuang Zhao. Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases, 2023. 3

work page 2023
[66]

Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023

Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2

work page 2023
[67]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 3

work page 2021
[68]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2

work page 2018
[69]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2

work page 2020
[70]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4

work page 2022
[72]

Kvqa: Knowledge-aware visual question answering

Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, 2019. 4

work page 2019
[73]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4

work page 2018
[74]

Textcaps: a dataset for image caption- ing with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4

work page 2020
[75]

Generative pretraining in mul- timodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in mul- timodality. Jul 2023. 3

work page 2023
[76]

Alpha-CLIP: A clip model focusing on wherever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-CLIP: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023. 3

work page arXiv 2023
[77]

Gemini: A family of highly capable multi- modal models, 2023

Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 1, 2

work page 2023
[78]

Internlm: A multilingual language model with progressively enhanced capabilities

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 1, 2, 4

work page 2023
[79]

Llama: Open and efficient foundation language mod- els

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language mod- els. arXiv.org, 2023. 2

work page 2023
[80]

Llama 2: Open foundation and fine-tuned chat models,

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models,

work page

Showing first 80 references.