pith. machine review for the scientific record. sign in

arxiv: 2309.15112 · v5 · pith:OGXGRTKCnew · submitted 2023-09-26 · 💻 cs.CV

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

Pith reviewed 2026-05-17 13:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modeltext-image compositioninterleaved generationmultimodal comprehensionmultilingual datasetbenchmark performanceGPT-4V evaluationcontent generation
0
0 comments X

The pith

InternLM-XComposer generates articles with automatically inserted context-appropriate images while achieving state-of-the-art results on vision-language benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternLM-XComposer as a vision-language large model capable of both comprehending and composing interleaved text and images. It takes a writing instruction and produces a full manuscript, intelligently placing images where they enhance the narrative. Training on a large multilingual multimodal dataset with crafted strategies enables rich visual understanding across languages. The model reports leading performance on benchmarks such as MME, MMBench, and others, with a custom human and GPT-4V evaluation showing competitive composition quality against GPT-4V and GPT-3.5. This combination points toward more immersive and natural vision-language interactions.

Core claim

InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition through interleaved text-image generation. Given a writing instruction, it creates coherent articles by identifying suitable locations for images and inserting the most appropriate visual candidates. This is supported by training on an extensive multi-modal multilingual database using carefully crafted strategies, resulting in deep understanding of visual content. The model achieves state-of-the-art results on mainstream benchmarks including the MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. For text-image composition, a custom evaluation using

What carries the argument

The interleaved text-image composition capability, which automatically identifies enhancement points in text and inserts fitting images based on the writing prompt.

If this is right

  • Simple text prompts can yield complete, visually enriched articles without separate image sourcing.
  • Multilingual training supports comprehension and composition across different languages and cultural contexts.
  • Top benchmark scores indicate strong foundational vision-language abilities that underpin the composition feature.
  • The custom evaluation framework provides a way to assess composition quality where standard metrics are lacking.
  • Public release of the model series opens opportunities for further development in multimodal content creation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such interleaved generation could extend to dynamic web content or personalized learning materials where visuals adapt to text.
  • Combining comprehension and composition in one model may lead to more interactive AI assistants that can both analyze and create visual stories.
  • Testing the model on real-world creative tasks like journalism or marketing copy could reveal practical utility beyond benchmarks.
  • The approach of using GPT-4V in evaluation might inspire similar hybrid human-AI assessment methods for other generative tasks.

Load-bearing premise

The custom human-plus-GPT-4V evaluation procedure reliably measures the quality of text-image compositions produced by the model.

What would settle it

A large-scale blind comparison study where independent evaluators rate randomly presented articles from InternLM-XComposer, GPT-4V, and GPT-3.5, finding that InternLM-XComposer scores substantially below the others on coherence and visual relevance.

read the original abstract

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes InternLM-XComposer, a vision-language large model for advanced text-image comprehension and composition. It emphasizes three key properties: the ability to generate coherent interleaved text-image articles from writing instructions by automatically inserting appropriate images, comprehension powered by extensive multilingual multimodal training data with crafted strategies, and state-of-the-art performance on benchmarks including MME, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. A custom evaluation procedure involving both human and GPT-4V judges is introduced for assessing text-image composition, where the model achieves competitive scores against GPT-4V and GPT-3.5. The model series is publicly released on GitHub.

Significance. If the empirical results and custom evaluation hold after additional validation, this contributes to multimodal models by demonstrating practical interleaved text-image generation alongside strong comprehension, with the open release enabling community use and extension. The SOTA benchmark claims, if supported by ablations, could inform training strategies for vision-language foundational models.

major comments (2)
  1. [§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.
  2. [§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrase 'robust evaluation procedure' is used without forward reference to the specific controls or metrics; adding a sentence linking to the evaluation section would improve clarity.
  2. [§4] Tables in §4: Benchmark results would benefit from error bars or statistical tests to substantiate 'state-of-the-art' claims across the listed datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.

    Authors: We agree that additional details would strengthen the validation of our custom evaluation procedure. In the revised manuscript we will report inter-rater agreement statistics, provide the explicit judging rubric, state the sample size for both human and GPT-4V evaluations, and include any observed correlations with external signals or downstream performance. These additions will improve transparency and support for the text-image composition claims. revision: yes

  2. Referee: [§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.

    Authors: We recognize that greater detail on training data and procedures would help attribute performance gains. In revision we will expand descriptions of the data curation strategies, training scale, and any ablation studies that were conducted. Exact mixture weights and certain data sources remain constrained by scale and licensing considerations, so we will focus on the reproducible high-level strategies and publicly released model components. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model with independent benchmark results

full rationale

The paper describes training and evaluating a vision-language model on standard benchmarks (MME, MMBench, etc.) plus a custom human/GPT-4V procedure for text-image composition. No mathematical derivation chain, equations, or first-principles claims exist that reduce to fitted parameters or self-citations by construction. The custom evaluation is explicitly motivated by the absence of established metrics and is presented as a practical assessment tool rather than a derived result. All performance claims rest on externally auditable benchmark scores and model outputs, making the work self-contained against independent verification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions of large-scale multimodal pretraining and fine-tuning; no new physical or mathematical axioms are introduced. Free parameters include model size, learning rate schedules, and data weighting strategies that are typical for such systems but not enumerated in the abstract.

free parameters (1)
  • multimodal data mixture weights
    Carefully crafted strategies for training on extensive multi-modal multilingual database imply hand-tuned or searched weights that affect comprehension performance.
axioms (1)
  • domain assumption Large-scale multimodal training on curated data yields deep visual understanding
    Invoked when stating that training results in a deep understanding of visual content.

pith-pipeline@v0.9.0 · 5673 in / 1334 out tokens · 89293 ms · 2026-05-17T13:42:12.353642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

    cs.CV 2024-08 conditional novelty 8.0

    MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

  2. SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    cs.AI 2025-11 unverdicted novelty 7.0

    SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.

  3. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  4. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

  5. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    cs.CV 2025-01 conditional novelty 6.0

    Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.

  6. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  7. Are We on the Right Way for Evaluating Large Vision-Language Models?

    cs.CV 2024-03 conditional novelty 6.0

    Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...

  8. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    cs.CV 2024-01 conditional novelty 6.0

    MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

  9. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    cs.HC 2024-01 unverdicted novelty 6.0

    SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

  10. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    cs.CV 2023-11 conditional novelty 6.0

    A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.

  11. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  12. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  13. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  14. Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    cs.CV 2024-03 unverdicted novelty 5.0

    Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

  15. InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    cs.CV 2024-01 unverdicted novelty 5.0

    InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

  16. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  17. VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    cs.CV 2025-01 conditional novelty 4.0

    VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

Reference graph

Works this paper leans on

119 extracted references · 119 canonical work pages · cited by 17 Pith papers · 10 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

  2. [2]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4

  3. [3]

    Openflamingo: An open- source framework for training large autoregressive vision- language models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3, 7

  4. [4]

    Qwen-vl: A frontier large vision-language model with versatile abilities

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 3, 6, 7, 17

  5. [5]

    Baichuan 2: Open large-scale language models

    Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3

  6. [6]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. 2

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2, 3

  8. [8]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 2, 4

  9. [9]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3

  10. [10]

    Shikra: Unleashing multimodal llm’s referential dialogue magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3, 6, 7

  11. [11]

    Pali-x: On scaling up a multilingual vision and language model, 2023

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

  12. [12]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4

  13. [13]

    Pali-3 vision language models: Smaller, faster, stronger, 2023

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3

  14. [14]

    Pali: A jointly-scaled multilingual language- image model, 2023

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

  15. [15]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 3, 4

  16. [16]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2, 3

  17. [17]

    Class-balanced loss based on effective number of samples

    Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, Jun 2019. 2

  18. [18]

    Instructblip: Towards general- purpose vision-language models with instruction tuning, 9

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning, 9

  19. [19]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 326–335, 2017. 4

  20. [20]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 3

  21. [21]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 3

  22. [22]

    Dreamllm: Synergistic multimodal com- prehension and creation

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,

  23. [23]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  24. [24]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 320–335, 2022. 2, 3, 6, 7, 17

  25. [25]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3

  26. [26]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 6

  27. [27]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 6, 7, 17

  28. [28]

    Planting a seed of vision in large language model

    Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3

  29. [29]

    Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark

    Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems , 35:26418–26431,

  30. [30]

    Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

    Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

  31. [31]

    LoRA: Low-rank adaptation of large language mod- els

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 5, 14

  32. [32]

    W. Hu, Y . Xu, Y . Li, W. Li, Z. Chen, and Z. Tu. Bliva: A simple multimodal llm for better handling of text-rich visual questions. ArXiv, abs/2308.09936, 2023. 6

  33. [33]

    Reveal: Retrieval-augmented visual- language pre-training with multi-source multimodal knowl- edge memory

    Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual- language pre-training with multi-source multimodal knowl- edge memory. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 23369–23379, 2023. 3

  34. [34]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4

  35. [35]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3

  36. [36]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 3

  37. [37]

    Grounding language models to images for multimodal in- puts and outputs

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal in- puts and outputs. 2023. 3

  38. [38]

    Openassistant conver- sations – democratizing large language model alignment,

    Andreas K ¨opf, Yannic Kilcher, Dimitri von R ¨utte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich ´ard Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conver- sations – democratizing large lang...

  39. [39]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web- scale filtered dataset of interleaved image-text documents,

  40. [40]

    Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 6

  41. [41]

    Otter: A multi-modal model with in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, 10 Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3, 6, 7, 17

  42. [42]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 2, 3, 4, 17

  43. [43]

    Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3, 7

  44. [44]

    Empowering vision- language models to follow interleaved vision-language in- structions

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Han- wang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision- language models to follow interleaved vision-language in- structions. ArXiv, abs/2308.04152, 2023. 6

  45. [45]

    Grounded language-image pre-training

    Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3

  46. [46]

    Lmeye: An interactive perception network for large language models, 2023

    Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang. Lmeye: An interactive perception network for large language models, 2023. 6

  47. [47]

    Visual spatial reasoning

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 4

  48. [48]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 4

  49. [49]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 6, 7, 17

  50. [50]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 6, 7, 17

  51. [51]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3

  52. [52]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7

  53. [53]

    Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training

    Yulong Liu, Guibo Zhu, Bin Zhu, Qi Song, Guojing Ge, Haoran Chen, GuanHui Qiao, Ru Peng, Lingxiang Wu, and Jinqiao Wang. Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Processing Systems , 35:16705– 16717, 2022. 4

  54. [54]

    Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. In CVPR, Jun 2019. 2

  55. [55]

    Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4

  56. [56]

    Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4

  57. [57]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4

  58. [58]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 4

  59. [59]

    Power laws, pareto distributions and zipf’s law

    MEJ Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, page 323–351, Sep 2005. 2

  60. [60]

    OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2, 3, 8

  61. [61]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023. 2, 8

  62. [62]

    Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 2, 4

  63. [63]

    Training language models to follow instructions with human feed- back

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 3

  64. [64]

    The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3

  65. [65]

    Kosmos-2: Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,

  66. [66]

    Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023

    Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2, 3, 8

  67. [67]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 15

  68. [68]

    Improving language understanding by gen- erative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

  69. [69]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2, 3 11

  70. [70]

    Hierarchical text-conditional image generation with clip latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 2

  71. [71]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ICML, Jul 2021

  72. [72]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, June 2022

  73. [73]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGon- tijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2

  74. [74]

    Laion-5b: An open large-scale dataset for train- ing next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 4

  75. [75]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 2, 4

  76. [76]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 4

  77. [77]

    Tiny lvlm-ehub: Early multimodal experiments with bard

    Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hong- sheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729 ,

  78. [78]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018. 2, 4

  79. [79]

    Textcaps: a dataset for image caption- ing with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16 , pages 742–758. Springer, 2020. 4

  80. [80]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 4

Showing first 80 references.