InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
Pith reviewed 2026-05-17 13:42 UTC · model grok-4.3
The pith
InternLM-XComposer generates articles with automatically inserted context-appropriate images while achieving state-of-the-art results on vision-language benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition through interleaved text-image generation. Given a writing instruction, it creates coherent articles by identifying suitable locations for images and inserting the most appropriate visual candidates. This is supported by training on an extensive multi-modal multilingual database using carefully crafted strategies, resulting in deep understanding of visual content. The model achieves state-of-the-art results on mainstream benchmarks including the MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. For text-image composition, a custom evaluation using
What carries the argument
The interleaved text-image composition capability, which automatically identifies enhancement points in text and inserts fitting images based on the writing prompt.
If this is right
- Simple text prompts can yield complete, visually enriched articles without separate image sourcing.
- Multilingual training supports comprehension and composition across different languages and cultural contexts.
- Top benchmark scores indicate strong foundational vision-language abilities that underpin the composition feature.
- The custom evaluation framework provides a way to assess composition quality where standard metrics are lacking.
- Public release of the model series opens opportunities for further development in multimodal content creation.
Where Pith is reading between the lines
- Such interleaved generation could extend to dynamic web content or personalized learning materials where visuals adapt to text.
- Combining comprehension and composition in one model may lead to more interactive AI assistants that can both analyze and create visual stories.
- Testing the model on real-world creative tasks like journalism or marketing copy could reveal practical utility beyond benchmarks.
- The approach of using GPT-4V in evaluation might inspire similar hybrid human-AI assessment methods for other generative tasks.
Load-bearing premise
The custom human-plus-GPT-4V evaluation procedure reliably measures the quality of text-image compositions produced by the model.
What would settle it
A large-scale blind comparison study where independent evaluators rate randomly presented articles from InternLM-XComposer, GPT-4V, and GPT-3.5, finding that InternLM-XComposer scores substantially below the others on coherence and visual relevance.
read the original abstract
We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition. The innovative nature of our model is highlighted by three appealing properties: 1) Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. Simply provide a writing instruction, and our system will generate the corresponding manuscript. It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates. 2) Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on an extensive multi-modal multilingual database with carefully crafted strategies, resulting in a deep understanding of visual content. 3) State-of-the-art Performance: Our model consistently achieves state-of-the-art results across various mainstream benchmarks for vision-language foundational models, including MME Benchmark, MMBench, MMBench-CN, Seed-Bench, CCBench (Chinese Cultural Benchmark), QBench and Tiny LVLM. Owing to the absence of established metrics for quantitatively assessing text-image composition, we have devised a robust evaluation procedure that comprises both human and GPT4-Vision (GPT4-V) to ensure reliability. Notably, our InternLM-XComposer achieves competitive text-image composition scores compared to public solutions, including GPT4-V and GPT3.5. Collectively, InternLM-XComposer seamlessly blends advanced text-image comprehension and composition, revolutionizing vision-language interaction and offering new insights and opportunities. The InternLM-XComposer model series are publicly available at https://github.com/InternLM/InternLM-XComposer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes InternLM-XComposer, a vision-language large model for advanced text-image comprehension and composition. It emphasizes three key properties: the ability to generate coherent interleaved text-image articles from writing instructions by automatically inserting appropriate images, comprehension powered by extensive multilingual multimodal training data with crafted strategies, and state-of-the-art performance on benchmarks including MME, MMBench, MMBench-CN, Seed-Bench, CCBench, QBench, and Tiny LVLM. A custom evaluation procedure involving both human and GPT-4V judges is introduced for assessing text-image composition, where the model achieves competitive scores against GPT-4V and GPT-3.5. The model series is publicly released on GitHub.
Significance. If the empirical results and custom evaluation hold after additional validation, this contributes to multimodal models by demonstrating practical interleaved text-image generation alongside strong comprehension, with the open release enabling community use and extension. The SOTA benchmark claims, if supported by ablations, could inform training strategies for vision-language foundational models.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.
- [§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'robust evaluation procedure' is used without forward reference to the specific controls or metrics; adding a sentence linking to the evaluation section would improve clarity.
- [§4] Tables in §4: Benchmark results would benefit from error bars or statistical tests to substantiate 'state-of-the-art' claims across the listed datasets.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§4 (Evaluation)] §4 (Evaluation): The custom human-plus-GPT-4V procedure for text-image composition lacks reported inter-rater agreement, explicit judging rubric, sample size, or correlation with external signals such as downstream task performance. This is load-bearing for the highlighted 'advanced interleaved text-image composition' property, as the SOTA comprehension results on MME/MMBench can be audited independently while composition scores depend on this unvalidated procedure.
Authors: We agree that additional details would strengthen the validation of our custom evaluation procedure. In the revised manuscript we will report inter-rater agreement statistics, provide the explicit judging rubric, state the sample size for both human and GPT-4V evaluations, and include any observed correlations with external signals or downstream performance. These additions will improve transparency and support for the text-image composition claims. revision: yes
-
Referee: [§3 (Training)] §3 (Training): The multimodal data mixture weights, exact data sources, training scale, and ablation studies are not detailed beyond high-level descriptions of 'carefully crafted strategies.' This undermines attribution of the reported benchmark gains and the claim of 'deep understanding of visual content' to the proposed approach rather than post-hoc choices.
Authors: We recognize that greater detail on training data and procedures would help attribute performance gains. In revision we will expand descriptions of the data curation strategies, training scale, and any ablation studies that were conducted. Exact mixture weights and certain data sources remain constrained by scale and licensing considerations, so we will focus on the reproducible high-level strategies and publicly released model components. revision: partial
Circularity Check
No circularity: empirical model with independent benchmark results
full rationale
The paper describes training and evaluating a vision-language model on standard benchmarks (MME, MMBench, etc.) plus a custom human/GPT-4V procedure for text-image composition. No mathematical derivation chain, equations, or first-principles claims exist that reduce to fitted parameters or self-citations by construction. The custom evaluation is explicitly motivated by the absence of established metrics and is presented as a practical assessment tool rather than a derived result. All performance claims rest on externally auditable benchmark scores and model outputs, making the work self-contained against independent verification.
Axiom & Free-Parameter Ledger
free parameters (1)
- multimodal data mixture weights
axioms (1)
- domain assumption Large-scale multimodal training on curated data yields deep visual understanding
Forward citations
Cited by 17 Pith papers
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
SpatialBench creates a five-level framework and 15-task benchmark to measure hierarchical spatial reasoning in MLLMs, finding strong basic perception but weak symbolic reasoning, causal inference, and planning.
-
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
-
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...
-
[2]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4
work page 2015
-
[3]
Openflamingo: An open- source framework for training large autoregressive vision- language models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3, 7
work page 2023
-
[4]
Qwen-vl: A frontier large vision-language model with versatile abilities
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 3, 6, 7, 17
work page 2023
-
[5]
Baichuan 2: Open large-scale language models
Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3
work page 2023
-
[6]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. 2
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2, 3
work page 1901
-
[8]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 2, 4
work page 2021
-
[9]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Shikra: Unleashing multimodal llm’s referential dialogue magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3, 6, 7
work page 2023
-
[11]
Pali-x: On scaling up a multilingual vision and language model, 2023
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...
work page 2023
-
[12]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4
work page 2015
-
[13]
Pali-3 vision language models: Smaller, faster, stronger, 2023
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3
work page 2023
-
[14]
Pali: A jointly-scaled multilingual language- image model, 2023
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...
work page 2023
-
[15]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 3, 4
work page 2023
-
[16]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2, 3
work page 2022
-
[17]
Class-balanced loss based on effective number of samples
Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, Jun 2019. 2
work page 2019
-
[18]
Instructblip: Towards general- purpose vision-language models with instruction tuning, 9
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning, 9
-
[19]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos ´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 326–335, 2017. 4
work page 2017
-
[20]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 248–255, 2009. 3
work page 2009
-
[21]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 3
work page 2018
-
[22]
Dreamllm: Synergistic multimodal com- prehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,
-
[23]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pages 320–335, 2022. 2, 3, 6, 7, 17
work page 2022
-
[25]
Eva: Exploring the limits of masked visual represen- tation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3
work page 2023
-
[26]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 6, 7, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Planting a seed of vision in large language model
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3
-
[29]
Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark
Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems , 35:26418–26431,
-
[30]
Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models
Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,
-
[31]
LoRA: Low-rank adaptation of large language mod- els
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 5, 14
work page 2022
- [32]
-
[33]
Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual- language pre-training with multi-source multimodal knowl- edge memory. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 23369–23379, 2023. 3
work page 2023
-
[34]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4
work page 2019
-
[35]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3
work page 2021
-
[36]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 3
work page 2023
-
[37]
Grounding language models to images for multimodal in- puts and outputs
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal in- puts and outputs. 2023. 3
work page 2023
-
[38]
Openassistant conver- sations – democratizing large language model alignment,
Andreas K ¨opf, Yannic Kilcher, Dimitri von R ¨utte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich ´ard Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conver- sations – democratizing large lang...
-
[39]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web- scale filtered dataset of interleaved image-text documents,
-
[40]
Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 6
work page 2023
-
[41]
Otter: A multi-modal model with in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, 10 Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3, 6, 7, 17
work page 2023
-
[42]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597, 2023. 2, 3, 4, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3, 7
work page 2022
-
[44]
Empowering vision- language models to follow interleaved vision-language in- structions
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Han- wang Zhang, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, and Yueting Zhuang. Empowering vision- language models to follow interleaved vision-language in- structions. ArXiv, abs/2308.04152, 2023. 6
-
[45]
Grounded language-image pre-training
Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[46]
Lmeye: An interactive perception network for large language models, 2023
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, and Min Zhang. Lmeye: An interactive perception network for large language models, 2023. 6
work page 2023
-
[47]
Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023. 4
work page 2023
-
[48]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 6, 7, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 6, 7, 17
work page 2023
-
[51]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3
work page 2023
-
[52]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training
Yulong Liu, Guibo Zhu, Bin Zhu, Qi Song, Guojing Ge, Haoran Chen, GuanHui Qiao, Ru Peng, Lingxiang Wu, and Jinqiao Wang. Taisu: A 166m large-scale high-quality dataset for chinese vision-language pre-training. Advances in Neural Information Processing Systems , 35:16705– 16717, 2022. 4
work page 2022
-
[54]
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X. Yu. Large-scale long-tailed recognition in an open world. In CVPR, Jun 2019. 2
work page 2019
-
[55]
Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4
work page 2022
-
[56]
Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram under- standing and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021. 4
-
[57]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4
work page 2019
-
[58]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019. 4
work page 2019
-
[59]
Power laws, pareto distributions and zipf’s law
MEJ Newman. Power laws, pareto distributions and zipf’s law. Contemporary Physics, page 323–351, Sep 2005. 2
work page 2005
-
[60]
OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2, 3, 8
work page 2022
- [61]
-
[62]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 2, 4
work page 2011
-
[63]
Training language models to follow instructions with human feed- back
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 3
work page 2022
-
[64]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3
work page 2023
-
[65]
Kosmos-2: Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,
-
[66]
Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023
Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2, 3, 8
work page 2023
-
[67]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 15
work page 2021
-
[68]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3
work page 2018
-
[69]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2, 3 11
work page 2020
-
[70]
Hierarchical text-conditional image generation with clip latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. 2
-
[71]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ICML, Jul 2021
work page 2021
-
[72]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, June 2022
work page 2022
-
[73]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar, Seyed Ghasemipour, Burcu Karagol, SSara Mahdavi, RaphaGon- tijo Lopes, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. 2
-
[74]
Laion-5b: An open large-scale dataset for train- ing next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for train- ing next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022. 4
work page 2022
-
[75]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[76]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision , pages 146–162. Springer, 2022. 4
work page 2022
-
[77]
Tiny lvlm-ehub: Early multimodal experiments with bard
Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hong- sheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729 ,
-
[78]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018. 2, 4
work page 2018
-
[79]
Textcaps: a dataset for image caption- ing with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16 , pages 742–758. Springer, 2020. 4
work page 2020
-
[80]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 4
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.