arxiv: 2407.03320 · v1 · submitted 2024-07-03 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Pan Zhang , Xiaoyi Dong , Yuhang Zang , Yuhang Cao , Rui Qian , Lin Chen , Qipeng Guo , Haodong Duan

show 19 more authors

Bin Wang Linke Ouyang Songyang Zhang Wenwei Zhang Yining Li Yang Gao Peng Sun Xinyue Zhang Wei Li Jingwen Li Wenhai Wang Hang Yan Conghui He Xingcheng Zhang Kai Chen Jifeng Dai Yu Qiao Dahua Lin Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords large vision language modellong contextmultimodalimage-text comprehensionvideo understandingRoPE extrapolationopen-source multimodal modeltext-image composition

0 comments

The pith

InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternLM-XComposer-2.5 as a versatile vision-language model built around a 7 billion parameter language model backend. It claims the system matches or approaches the performance of much larger proprietary models like GPT-4V across text-image understanding and generation tasks. Training on 24K interleaved image-text sequences allows seamless extension to 96K contexts through RoPE extrapolation, which supports applications needing long inputs or outputs. Three upgrades focus on ultra-high resolution image comprehension, detailed video analysis, and multi-turn dialogues involving multiple images. Two additional capabilities use extra LoRA adapters for generating webpages and high-quality illustrated articles. The model is tested on 28 benchmarks where it leads open-source alternatives on 16 and matches or exceeds GPT-4V and Gemini Pro on 16 key tasks.

Core claim

InternLM-XComposer-2.5 is a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, it features three major upgrades in vision-language comprehension: ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue.

What carries the argument

The RoPE extrapolation from 24K training contexts to 96K inference contexts, applied to a 7B LLM backbone with vision encoder and optional LoRA adapters for composition tasks.

If this is right

The model outperforms existing open-source state-of-the-art systems on 16 of the 28 evaluated benchmarks.
It surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks in comprehension and composition.
Long-context support enables new uses in tasks that require processing or producing extensive image-text sequences.
The three comprehension upgrades plus LoRA-based composition features expand the range of practical applications beyond prior versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the extrapolation technique generalizes, similar small backbones could handle extended multimodal conversations without retraining from scratch.
Success in webpage crafting suggests the approach could be adapted for automated document or presentation generation tools.
Competing with closed models on selected tasks may accelerate development of accessible alternatives for research and education.

Load-bearing premise

The 28 benchmarks and 16 key tasks chosen for evaluation accurately reflect real-world multimodal performance and that context extrapolation introduces no hidden quality loss on long outputs.

What would settle it

A head-to-head test on a long multi-image dialogue or 96K output task where InternLM-XComposer-2.5 scores materially below GPT-4V would falsify the central performance claim.

read the original abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces InternLM-XComposer-2.5 (IXC-2.5), a 7B-parameter vision-language model supporting long-contextual input and output. Trained on 24K interleaved image-text contexts, it extends to 96K via RoPE extrapolation. Key upgrades include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. It adds text-image composition applications (webpage crafting and article composition) via extra LoRA parameters. Evaluations on 28 benchmarks show outperformance over open-source SOTA on 16 benchmarks and competitive or superior results to GPT-4V and Gemini Pro on 16 key tasks.

Significance. If validated, this would be a solid contribution to open-source multimodal modeling by showing competitive GPT-4V-level performance on long-context comprehension and generation tasks using a compact 7B backbone. The public release, emphasis on practical composition applications, and extension of standard RoPE techniques to interleaved multimodal settings are strengths that could influence efficient VLM development.

major comments (2)

The central claim that RoPE extrapolation enables seamless 96K long-context capability for both inputs and outputs (particularly autoregressive generation in composition and multi-turn tasks) lacks supporting analysis. No ablation or error analysis is provided on whether positional errors accumulate in long outputs, which directly underpins the asserted superiority in webpage crafting, article composition, and multi-turn dialogue.
Benchmark evaluation section: performance claims of outperforming open-source models on 16 of 28 benchmarks and competing with GPT-4V on 16 key tasks rest on aggregate scores without reported variance, statistical tests, or full training/data details. This weakens verifiability of the GPT-4V-level and long-context superiority assertions.

minor comments (2)

Abstract: the specific 16 key tasks and the full list of 28 benchmarks are not enumerated, reducing clarity on the scope of the comparisons.
Notation for context lengths should be standardized (e.g., consistently using 'tokens' or 'K' with explicit definition) across sections describing training and inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our long-context claims and evaluation rigor.

read point-by-point responses

Referee: The central claim that RoPE extrapolation enables seamless 96K long-context capability for both inputs and outputs (particularly autoregressive generation in composition and multi-turn tasks) lacks supporting analysis. No ablation or error analysis is provided on whether positional errors accumulate in long outputs, which directly underpins the asserted superiority in webpage crafting, article composition, and multi-turn dialogue.

Authors: We appreciate the referee's emphasis on this point. While the empirical results on long-context tasks (multi-turn multi-image dialogue and text-image composition) already show strong performance with 96K contexts, we agree that explicit analysis of positional error accumulation during autoregressive generation would provide more direct support. In the revised manuscript we will add an ablation subsection that measures generation quality degradation and positional error metrics over increasing output lengths up to 96K tokens. revision: yes
Referee: Benchmark evaluation section: performance claims of outperforming open-source models on 16 of 28 benchmarks and competing with GPT-4V on 16 key tasks rest on aggregate scores without reported variance, statistical tests, or full training/data details. This weakens verifiability of the GPT-4V-level and long-context superiority assertions.

Authors: We acknowledge that reporting variance and statistical tests would increase verifiability. In the revision we will add standard deviations across multiple evaluation runs for the key benchmarks and include pairwise statistical significance tests against the strongest baselines. We will also expand the training and data section with additional hyperparameter and data-mixture details. The complete training code and dataset recipes are already released in the public GitHub repository to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical model claims

full rationale

The paper reports empirical benchmark results on 28 external tasks, with performance claims resting on direct comparisons to GPT-4V, Gemini Pro, and open-source baselines rather than any internal derivation. Long-context support is implemented via standard RoPE extrapolation from a 24K training regime, with no equations or self-referential definitions that reduce reported capabilities to quantities fitted inside the same work. Self-citations to the prior 2.0 version exist but are not load-bearing for the new results, which remain independently falsifiable on public benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work relies on standard transformer training with many unfixed hyperparameters for pretraining, instruction tuning, and LoRA adaptation; RoPE scaling parameters are chosen to reach 96K from 24K training data.

free parameters (1)

RoPE extrapolation scaling factors
Chosen to extend context from 24K training to 96K inference without additional training data.

pith-pipeline@v0.9.0 · 5665 in / 1169 out tokens · 45550 ms · 2026-05-17T10:41:59.051328+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DimensionForcing eight_tick_forces_D3 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
cs.CV 2026-05 unverdicted novelty 7.0

ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
cs.CV 2026-04 unverdicted novelty 7.0

GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
cs.CV 2025-02 unverdicted novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

183 extracted references · 183 canonical work pages · cited by 16 Pith papers · 33 internal anchors

[1]

Nocaps: Novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2019. 8

work page 2019
[2]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page
[3]

Claude 3 haiku: our fastest model yet,

Anthropic. Claude 3 haiku: our fastest model yet,

work page
[4]

Available at: https://www.anthropic.com/ news/claude-3-haiku. 1, 8

work page
[5]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2015. 8

work page 2015
[6]

Openflamingo: An open- source framework for training large autoregressive vision- language models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 2

work page 2023
[7]

Qwen-VL: A frontier large vision-language model with versatile abilities

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 2, 9

work page 2023
[8]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Baichuan 2: Open large-scale language models

Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2

work page 2023
[10]

Introducing our multimodal models, 2023

Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 2

work page 2023
[11]

pix2code: Generating code from a graph- ical user interface screenshot

Tony Beltramelli. pix2code: Generating code from a graph- ical user interface screenshot. In Proceedings of the ACM SIGCHI symposium on engineering interactive computing systems, 2018. 6

work page 2018
[12]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, 2019. 8

work page 2019
[13]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2

work page 1901
[14]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiax- ing Li, Jingwen Li, Linyang Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

DualFocus: Integrating macro and micro per- spectives in multi-modal large language models

Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, and Ji- aqi Wang. DualFocus: Integrating macro and micro per- spectives in multi-modal large language models. arXiv preprint arXiv:2402.14767, 2024. 2

work page arXiv 2024
[16]

ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang. ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 8

work page arXiv 2024
[17]

Shikra: Unleashing multimodal llm’s referential dialogue magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 8

work page 2023
[18]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 2, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

ShareGPT4Video: Improving video understanding and generation with better captions

Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 8

work page arXiv 2024
[21]

TabFact: A large-scale dataset for table-based fact verification

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. TabFact: A large-scale dataset for table-based fact verification. In Proceedings of the Inter- 11 national Conference on Learning Representations (ICLR) ,

work page
[22]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 8

work page 2015
[23]

Pali-x: On scaling up a multilingual vision and language model, 2023

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

work page 2023
[24]

Pali-3 vision language models: Smaller, faster, stronger, 2023

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023

work page 2023
[25]

Pali: A jointly-scaled multilingual language- image model, 2023

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

work page 2023
[26]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

How far are we to gpt- 4v? closing the gap to commercial multimodal models with open-source suites, 2024

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong ...

work page 2024
[28]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 8

work page 2023
[29]

Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art

Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art. In Interna- tional Conference on Document Analysis and Recognition (ICDAR), 2019. 8

work page 2019
[30]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 1, 2

work page 2022
[31]

Opencompass: A univer- sal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

work page
[32]

Instructblip: Towards general- purpose vision-language models with instruction tuning,

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

work page
[33]

Moura, Devi Parikh, and Dhruv Batra

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[34]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jing- wen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and compre...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2- 4khd: A pioneering large vision-language mod...

work page arXiv 2024
[36]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model align- ment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

ActivityNet: A large-scale video 12 benchmark for human activity understanding

Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. ActivityNet: A large-scale video 12 benchmark for human activity understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2, 8

work page 2015
[39]

MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 9

work page arXiv 2024
[40]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing, 2024

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing, 2024. 2

work page 2024
[41]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

A challenger to gpt-4v? early explorations of gemini in visual expertise

Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 1, 2

work page arXiv 2023
[43]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. ChatGLM: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023. 2, 9

work page 2023
[46]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024. 2

work page arXiv 2024
[47]

Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

work page arXiv
[48]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 ,

work page arXiv
[49]

CogAgent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 10

work page 2024
[51]

mPLUG-DocOwl 1.5: Unified structure learn- ing for ocr-free document understanding

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mPLUG-DocOwl 1.5: Unified structure learn- ing for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024. 9

work page arXiv 2024
[52]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. 8

work page 2022
[53]

From image to video, what do we need in multimodal llms? arXiv preprint arXiv:2404.11865, 2024

Suyuan Huang, Haoxin Zhang, Yan Gao, Yao Hu, and Zengchang Qin. From image to video, what do we need in multimodal llms? arXiv preprint arXiv:2404.11865, 2024. 2

work page arXiv 2024
[54]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 8

work page 2019
[55]

Video recap: Recursive captioning of hour-long videos

Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Na- garajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos. arXiv preprint arXiv:2402.13250, 2024. 2

work page arXiv 2024
[56]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 1, 2

work page 2023
[57]

Mantis: Interleaved multi- image instruction tuning, 2024

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi- image instruction tuning, 2024. 2

work page 2024
[58]

Chat-univi: Unified visual representa- tion empowers large language models with image and video understanding

Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representa- tion empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023. 2

work page arXiv 2023
[59]

DVQA: Understanding data visualizations via ques- tion answering

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via ques- tion answering. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

work page
[60]

Language repository for long video understanding

Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S Ryoo. Language repository for long video understanding. arXiv preprint arXiv:2403.14622 ,

work page arXiv
[61]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for 13 neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[62]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. 2, 8, 9

work page 2016
[63]

Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 8

work page 2017
[64]

An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024

Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024. 6

work page 2024
[65]

Unlock- ing the conversion of web screenshots into html code with the websight dataset

Hugo Laurenc ¸on, L´eo Tronchon, and Victor Sanh. Unlock- ing the conversion of web screenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029 ,

work page arXiv
[66]

Viquae, a dataset for knowledge-based visual question answering about named entities

Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G Moreno, and Jes ´us Lov´on Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In Pro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 3108–3120, 2022. 8

work page 2022
[67]

Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 9

work page 2023
[68]

Otterhd: A high-resolution multi- modality model, 2023

Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model, 2023. 2

work page 2023
[69]

Otter: A multi-modal model with in-context instruction tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 2

work page 2023
[70]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. arXiv preprint arXiv:2311.17005 ,

work page arXiv
[72]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 9

work page 2024
[73]

Silkie: Preference distillation for large visual language models

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 6

work page arXiv 2023
[74]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 2

work page arXiv 2023
[75]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-Gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Super-clevr: A virtual benchmark to diagnose do- main robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose do- main robustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 8

work page 2023
[77]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 2

work page arXiv 2023
[78]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Vila: On pre-training for visual language models, 2024

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. 2, 9

work page 2024
[80]

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 8

work page arXiv 2022

Showing first 80 references.