arxiv: 2309.15112 · v5 · pith:OGXGRTKCnew · submitted 2023-09-26 · 💻 cs.CV

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Pan Zhang , Xiaoyi Dong , Bin Wang , Yuhang Cao , Chao Xu , Linke Ouyang , Zhiyuan Zhao , Haodong Duan

show 13 more authors

Songyang Zhang Shuangrui Ding Wenwei Zhang Hang Yan Xinyue Zhang Wei Li Jingwen Li Kai Chen Conghui He Xingcheng Zhang Yu Qiao Dahua Lin Jiaqi Wang

This is my paper

Pith reviewed 2026-05-17 13:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modeltext-image compositioninterleaved generationmultimodal comprehensionmultilingual datasetbenchmark performanceGPT-4V evaluationcontent generation

0 comments

The pith

InternLM-XComposer generates articles with automatically inserted context-appropriate images while achieving state-of-the-art results on vision-language benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternLM-XComposer as a vision-language large model capable of both comprehending and composing interleaved text and images. It takes a writing instruction and produces a full manuscript, intelligently placing images where they enhance the narrative. Training on a large multilingual multimodal dataset with crafted strategies enables rich visual understanding across languages. The model reports leading performance on benchmarks such as MME, MMBench, and others, with a custom human and GPT-4V evaluation showing competitive composition quality against GPT-4V and GPT-3.5. This combination points toward more immersive and natural vision-language interactions.

Core claim

InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition through interleaved text-image generation. Given a writing instruction, it creates coherent articles by identifying suitable locations for images and inserting the most appropriate visual candidates. This is supported by training on an extensive multi-modal multilingual database using carefully crafted strategies, resulting in deep understanding of visual content. The model achieves state-of-the-art results on mainstream benchmarks including the MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. For text-image composition, a custom evaluation using

What carries the argument

The interleaved text-image composition capability, which automatically identifies enhancement points in text and inserts fitting images based on the writing prompt.

If this is right

Simple text prompts can yield complete, visually enriched articles without separate image sourcing.
Multilingual training supports comprehension and composition across different languages and cultural contexts.
Top benchmark scores indicate strong foundational vision-language abilities that underpin the composition feature.
The custom evaluation framework provides a way to assess composition quality where standard metrics are lacking.
Public release of the model series opens opportunities for further development in multimodal content creation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such interleaved generation could extend to dynamic web content or personalized learning materials where visuals adapt to text.
Combining comprehension and composition in one model may lead to more interactive AI assistants that can both analyze and create visual stories.
Testing the model on real-world creative tasks like journalism or marketing copy could reveal practical utility beyond benchmarks.
The approach of using GPT-4V in evaluation might inspire similar hybrid human-AI assessment methods for other generative tasks.

Load-bearing premise

The custom human-plus-GPT-4V evaluation procedure reliably measures the quality of text-image compositions produced by the model.

What would settle it

A large-scale blind comparison study where independent evaluators rate randomly presented articles from InternLM-XComposer, GPT-4V, and GPT-3.5, finding that InternLM-XComposer scores substantially below the others on coherence and visual relevance.

read the original abstract

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternLM-XComposer releases an open VLM that adds practical interleaved text-image generation on top of standard comprehension benchmarks, though the supporting evaluation for the new feature is thin on validation.

read the letter

InternLM-XComposer is mainly a public model release that lets you prompt for an article and have it write the text while automatically picking and inserting fitting images at the right spots. That interleaved composition is the clearest addition beyond the usual vision-language setup. The paper also reports leading numbers on MME, MMBench, Seed-Bench, and the Chinese benchmarks, plus some multilingual training angle. The public weights are the part that actually lets others check whether the composition works in practice rather than just reading claims. The benchmark results give a reasonable snapshot of comprehension performance across several suites. The main soft spot is the composition evaluation. They built a custom human-plus-GPT-4V procedure because no standard metric exists, but the text gives no inter-rater numbers, no explicit rubric, and no check against any external signal like downstream task success. That leaves open the chance that the competitive scores versus GPT-4V mostly track judge preferences instead of measurable quality. Training details are also light on exact data mixtures and scale, so it is hard to separate the claimed strategies from simple data volume. This paper is for groups working on applied multimodal tools, especially content generation or mixed-media interfaces. Readers who need a ready model to test or extend will find the release and the standard-benchmark numbers useful. The work engages the existing literature through the cited benchmarks and prior VLMs, so it is worth sending to a serious referee who can examine the evaluation setup and any unreported ablations in the full manuscript. I would recommend peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes InternLM-XComposer, a vision-language large model for advanced text-image comprehension and composition. It emphasizes three key properties: the ability to generate coherent interleaved text-image articles from writing instructions by automatically inserting appropriate images, comprehension powered by extensive multilingual multimodal training data with crafted strategies, and state-of-the-art performance on benchmarks including MME, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. A custom evaluation procedure involving both human and GPT-4V judges is introduced for assessing text-image composition, where the model achieves competitive scores against GPT-4V and GPT-3.5. The model series is publicly released on GitHub.

Significance. If the empirical results and custom evaluation hold after additional validation, this contributes to multimodal models by demonstrating practical interleaved text-image generation alongside strong comprehension, with the open release enabling community use and extension. The SOTA benchmark claims, if supported by ablations, could inform training strategies for vision-language foundational models.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.
[§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'robust evaluation procedure' is used without forward reference to the specific controls or metrics; adding a sentence linking to the evaluation section would improve clarity.
[§4] Tables in §4: Benchmark results would benefit from error bars or statistical tests to substantiate 'state-of-the-art' claims across the listed datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.

Authors: We agree that additional details would strengthen the validation of our custom evaluation procedure. In the revised manuscript we will report inter-rater agreement statistics, provide the explicit judging rubric, state the sample size for both human and GPT-4V evaluations, and include any observed correlations with external signals or downstream performance. These additions will improve transparency and support for the text-image composition claims. revision: yes
Referee: [§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.

Authors: We recognize that greater detail on training data and procedures would help attribute performance gains. In revision we will expand descriptions of the data curation strategies, training scale, and any ablation studies that were conducted. Exact mixture weights and certain data sources remain constrained by scale and licensing considerations, so we will focus on the reproducible high-level strategies and publicly released model components. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model with independent benchmark results

full rationale

The paper describes training and evaluating a vision-language model on standard benchmarks (MME, MMBench, etc.) plus a custom human/GPT-4V procedure for text-image composition. No mathematical derivation chain, equations, or first-principles claims exist that reduce to fitted parameters or self-citations by construction. The custom evaluation is explicitly motivated by the absence of established metrics and is presented as a practical assessment tool rather than a derived result. All performance claims rest on externally auditable benchmark scores and model outputs, making the work self-contained against independent verification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions of large-scale multimodal pretraining and fine-tuning; no new physical or mathematical axioms are introduced. Free parameters include model size, learning rate schedules, and data weighting strategies that are typical for such systems but not enumerated in the abstract.

free parameters (1)

multimodal data mixture weights
Carefully crafted strategies for training on extensive multi-modal multilingual database imply hand-tuned or searched weights that affect comprehension performance.

axioms (1)

domain assumption Large-scale multimodal training on curated data yields deep visual understanding
Invoked when stating that training results in a deep understanding of visual content.

pith-pipeline@v0.9.0 · 5673 in / 1334 out tokens · 89293 ms · 2026-05-17T13:42:12.353642+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
cs.AI 2025-11 unverdicted novelty 7.0

SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
cs.CV 2025-01 conditional novelty 6.0

Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
cs.CV 2024-01 conditional novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
cs.HC 2024-01 unverdicted novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
cs.CV 2024-03 unverdicted novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
cs.CV 2024-01 unverdicted novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
cs.CV 2025-01 conditional novelty 4.0

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · cited by 17 Pith papers · 10 internal anchors

[1]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page
[2]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4

work page 2015
[3]

Openflamingo: An open- source framework for training large autoregressive vision- language models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3, 7

work page 2023
[4]

Qwen-vl: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 3, 6, 7, 17

work page 2023
[5]

Baichuan 2: Open large-scale language models

Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3

work page 2023
[6]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. 2

work page
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2, 3

work page 1901
[8]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 2, 4

work page 2021
[9]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Shikra: Unleashing multimodal llm’s referential dialogue magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3, 6, 7

work page 2023
[11]

Pali-x: On scaling up a multilingual vision and language model, 2023

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

work page 2023
[12]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4

work page 2015
[13]

Pali-3 vision language models: Smaller, faster, stronger, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3

work page 2023
[14]

Pali: A jointly-scaled multilingual language- image model, 2023

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

work page 2023
[15]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 3, 4

work page 2023
[16]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2, 3

work page 2022
[17]

Class-balanced loss based on effective number of samples

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, Jun 2019. 2

work page 2019
[18]

Instructblip: Towards general- purpose vision-language models with instruction tuning, 9

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning, 9

work page
[19]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 326–335, 2017. 4

work page 2017
[20]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 3

work page 2009
[21]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 3

work page 2018
[22]

Dreamllm: Synergistic multimodal com- prehension and creation

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[23]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 320–335, 2022. 2, 3, 6, 7, 17

work page 2022
[25]

Eva: Exploring the limits of masked visual represen- tation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3

work page 2023
[26]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 6, 7, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Planting a seed of vision in large language model

Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3

work page
[29]

Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems , 35:26418–26431,

work page
[30]

Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

work page arXiv
[31]

LoRA: Low-rank adaptation of large language mod- els

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 5, 14

work page 2022
[32]

W. Hu, Y . Xu, Y . Li, W. Li, Z. Chen, and Z. Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions. ArXiv, abs/2308.09936, 2023. 6

work page arXiv 2023
[33]

Reveal: Retrieval-augmented visual- language pre-training with multi-source multimodal knowl- edge memory

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual- language pre-training with multi-source multimodal knowl- edge memory. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 23369–23379, 2023. 3

work page 2023
[34]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

work page 2019
[35]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3

work page 2021
[36]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 3

work page 2023
[37]

Grounding language models to images for multimodal in- puts and outputs

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal in- puts and outputs. 2023. 3

work page 2023
[38]

Openassistant conver- sations – democratizing large language model alignment,

Andreas K ¨opf, Yannic Kilcher, Dimitri von R ¨utte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich ´ard Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conver- sations – democratizing large lang...

work page
[39]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web- scale filtered dataset of interleaved image-text documents,

work page
[40]

Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 6

work page 2023
[41]

Otter: A multi-modal model with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, 10 Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3, 6, 7, 17

work page 2023
[42]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 2, 3, 4, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3, 7

work page 2022
[44]

Empowering vision- language models to follow interleaved vision-language in- structions

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Han- wang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision- language models to follow interleaved vision-language in- structions. ArXiv, abs/2308.04152, 2023. 6

work page arXiv 2023
[45]

Grounded language-image pre-training

Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

work page 2022
[46]

Lmeye: An interactive perception network for large language models, 2023

Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang. Lmeye: An interactive perception network for large language models, 2023. 6

work page 2023
[47]

Visual spatial reasoning

Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 4

work page 2023
[48]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 6, 7, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 6, 7, 17

work page 2023
[51]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3

work page 2023
[52]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training

Yulong Liu, Guibo Zhu, Bin Zhu, Qi Song, Guojing Ge, Haoran Chen, GuanHui Qiao, Ru Peng, Lingxiang Wu, and Jinqiao Wang. Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Processing Systems , 35:16705– 16717, 2022. 4

work page 2022
[54]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. In CVPR, Jun 2019. 2

work page 2019
[55]

Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4

work page 2022
[56]

Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4

work page arXiv 2021
[57]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4

work page 2019
[58]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 4

work page 2019
[59]

Power laws, pareto distributions and zipf’s law

MEJ Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, page 323–351, Sep 2005. 2

work page 2005
[60]

OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2, 3, 8

work page 2022
[61]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 2, 8

work page 2023
[62]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 2, 4

work page 2011
[63]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 3

work page 2022
[64]

The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3

work page 2023
[65]

Kosmos-2: Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,

work page
[66]

Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023

Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2, 3, 8

work page 2023
[67]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 15

work page 2021
[68]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

work page 2018
[69]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2, 3 11

work page 2020
[70]

Hierarchical text-conditional image generation with clip latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 2

work page
[71]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ICML, Jul 2021

work page 2021
[72]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, June 2022

work page 2022
[73]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGon- tijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2

work page
[74]

Laion-5b: An open large-scale dataset for train- ing next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 4

work page 2022
[75]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[76]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 4

work page 2022
[77]

Tiny lvlm-ehub: Early multimodal experiments with bard

Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hong- sheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729 ,

work page arXiv
[78]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018. 2, 4

work page 2018
[79]

Textcaps: a dataset for image caption- ing with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16 , pages 742–758. Springer, 2020. 4

work page 2020
[80]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 4

work page 2019

Showing first 80 references.