pith. machine review for the scientific record. sign in

arxiv: 2407.03320 · v1 · submitted 2024-07-03 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:41 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords large vision language modellong contextmultimodalimage-text comprehensionvideo understandingRoPE extrapolationopen-source multimodal modeltext-image composition
0
0 comments X

The pith

InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents InternLM-XComposer-2.5 as a versatile vision-language model built around a 7 billion parameter language model backend. It claims the system matches or approaches the performance of much larger proprietary models like GPT-4V across text-image understanding and generation tasks. Training on 24K interleaved image-text sequences allows seamless extension to 96K contexts through RoPE extrapolation, which supports applications needing long inputs or outputs. Three upgrades focus on ultra-high resolution image comprehension, detailed video analysis, and multi-turn dialogues involving multiple images. Two additional capabilities use extra LoRA adapters for generating webpages and high-quality illustrated articles. The model is tested on 28 benchmarks where it leads open-source alternatives on 16 and matches or exceeds GPT-4V and Gemini Pro on 16 key tasks.

Core claim

InternLM-XComposer-2.5 is a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, it features three major upgrades in vision-language comprehension: ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue.

What carries the argument

The RoPE extrapolation from 24K training contexts to 96K inference contexts, applied to a 7B LLM backbone with vision encoder and optional LoRA adapters for composition tasks.

If this is right

  • The model outperforms existing open-source state-of-the-art systems on 16 of the 28 evaluated benchmarks.
  • It surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks in comprehension and composition.
  • Long-context support enables new uses in tasks that require processing or producing extensive image-text sequences.
  • The three comprehension upgrades plus LoRA-based composition features expand the range of practical applications beyond prior versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the extrapolation technique generalizes, similar small backbones could handle extended multimodal conversations without retraining from scratch.
  • Success in webpage crafting suggests the approach could be adapted for automated document or presentation generation tools.
  • Competing with closed models on selected tasks may accelerate development of accessible alternatives for research and education.

Load-bearing premise

The 28 benchmarks and 16 key tasks chosen for evaluation accurately reflect real-world multimodal performance and that context extrapolation introduces no hidden quality loss on long outputs.

What would settle it

A head-to-head test on a long multi-image dialogue or 96K output task where InternLM-XComposer-2.5 scores materially below GPT-4V would falsify the central performance claim.

read the original abstract

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces InternLM-XComposer-2.5 (IXC-2.5), a 7B-parameter vision-language model supporting long-contextual input and output. Trained on 24K interleaved image-text contexts, it extends to 96K via RoPE extrapolation. Key upgrades include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. It adds text-image composition applications (webpage crafting and article composition) via extra LoRA parameters. Evaluations on 28 benchmarks show outperformance over open-source SOTA on 16 benchmarks and competitive or superior results to GPT-4V and Gemini Pro on 16 key tasks.

Significance. If validated, this would be a solid contribution to open-source multimodal modeling by showing competitive GPT-4V-level performance on long-context comprehension and generation tasks using a compact 7B backbone. The public release, emphasis on practical composition applications, and extension of standard RoPE techniques to interleaved multimodal settings are strengths that could influence efficient VLM development.

major comments (2)
  1. The central claim that RoPE extrapolation enables seamless 96K long-context capability for both inputs and outputs (particularly autoregressive generation in composition and multi-turn tasks) lacks supporting analysis. No ablation or error analysis is provided on whether positional errors accumulate in long outputs, which directly underpins the asserted superiority in webpage crafting, article composition, and multi-turn dialogue.
  2. Benchmark evaluation section: performance claims of outperforming open-source models on 16 of 28 benchmarks and competing with GPT-4V on 16 key tasks rest on aggregate scores without reported variance, statistical tests, or full training/data details. This weakens verifiability of the GPT-4V-level and long-context superiority assertions.
minor comments (2)
  1. Abstract: the specific 16 key tasks and the full list of 28 benchmarks are not enumerated, reducing clarity on the scope of the comparisons.
  2. Notation for context lengths should be standardized (e.g., consistently using 'tokens' or 'K' with explicit definition) across sections describing training and inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our long-context claims and evaluation rigor.

read point-by-point responses
  1. Referee: The central claim that RoPE extrapolation enables seamless 96K long-context capability for both inputs and outputs (particularly autoregressive generation in composition and multi-turn tasks) lacks supporting analysis. No ablation or error analysis is provided on whether positional errors accumulate in long outputs, which directly underpins the asserted superiority in webpage crafting, article composition, and multi-turn dialogue.

    Authors: We appreciate the referee's emphasis on this point. While the empirical results on long-context tasks (multi-turn multi-image dialogue and text-image composition) already show strong performance with 96K contexts, we agree that explicit analysis of positional error accumulation during autoregressive generation would provide more direct support. In the revised manuscript we will add an ablation subsection that measures generation quality degradation and positional error metrics over increasing output lengths up to 96K tokens. revision: yes

  2. Referee: Benchmark evaluation section: performance claims of outperforming open-source models on 16 of 28 benchmarks and competing with GPT-4V on 16 key tasks rest on aggregate scores without reported variance, statistical tests, or full training/data details. This weakens verifiability of the GPT-4V-level and long-context superiority assertions.

    Authors: We acknowledge that reporting variance and statistical tests would increase verifiability. In the revision we will add standard deviations across multiple evaluation runs for the key benchmarks and include pairwise statistical significance tests against the strongest baselines. We will also expand the training and data section with additional hyperparameter and data-mixture details. The complete training code and dataset recipes are already released in the public GitHub repository to support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical model claims

full rationale

The paper reports empirical benchmark results on 28 external tasks, with performance claims resting on direct comparisons to GPT-4V, Gemini Pro, and open-source baselines rather than any internal derivation. Long-context support is implemented via standard RoPE extrapolation from a 24K training regime, with no equations or self-referential definitions that reduce reported capabilities to quantities fitted inside the same work. Self-citations to the prior 2.0 version exist but are not load-bearing for the new results, which remain independently falsifiable on public benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The work relies on standard transformer training with many unfixed hyperparameters for pretraining, instruction tuning, and LoRA adaptation; RoPE scaling parameters are chosen to reach 96K from 24K training data.

free parameters (1)
  • RoPE extrapolation scaling factors
    Chosen to extend context from 24K training to 96K inference without additional training data.

pith-pipeline@v0.9.0 · 5665 in / 1169 out tokens · 45550 ms · 2026-05-17T10:41:59.051328+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing eight_tick_forces_D3 unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  3. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  4. Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    ScaleEarth conditions remote sensing VLMs on continuous GSD via CS-HLoRA and a visual GSD predictor, creating a closed training loop with GeoScale-VQA to achieve SOTA on Earth observation benchmarks.

  5. OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

  6. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

  7. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  8. WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    cs.CV 2025-02 unverdicted novelty 7.0

    WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

  9. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  10. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  11. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  12. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  13. Visual-RFT: Visual Reinforcement Fine-Tuning

    cs.CV 2025-03 conditional novelty 6.0

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  14. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  15. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  16. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Reference graph

Works this paper leans on

183 extracted references · 183 canonical work pages · cited by 16 Pith papers · 33 internal anchors

  1. [1]

    Nocaps: Novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2019. 8

  2. [2]

    Flamingo: a visual language model for few-shot learning,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

  3. [3]

    Claude 3 haiku: our fastest model yet,

    Anthropic. Claude 3 haiku: our fastest model yet,

  4. [4]

    Available at: https://www.anthropic.com/ news/claude-3-haiku. 1, 8

  5. [5]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion (ICCV), 2015. 8

  6. [6]

    Openflamingo: An open- source framework for training large autoregressive vision- language models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 2

  7. [7]

    Qwen-VL: A frontier large vision-language model with versatile abilities

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 2, 9

  8. [8]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022. 6

  9. [9]

    Baichuan 2: Open large-scale language models

    Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2

  10. [10]

    Introducing our multimodal models, 2023

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Tas ¸ırlar. Introducing our multimodal models, 2023. 2

  11. [11]

    pix2code: Generating code from a graph- ical user interface screenshot

    Tony Beltramelli. pix2code: Generating code from a graph- ical user interface screenshot. In Proceedings of the ACM SIGCHI symposium on engineering interactive computing systems, 2018. 6

  12. [12]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marc ¸al Rusinol, Ernest Valveny, CV Jawahar, and Dimos- thenis Karatzas. Scene text visual question answering. In ICCV, 2019. 8

  13. [13]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2

  14. [14]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiax- ing Li, Jingwen Li, Linyang Li,...

  15. [15]

    DualFocus: Integrating macro and micro per- spectives in multi-modal large language models

    Yuhang Cao, Pan Zhang, Xiaoyi Dong, Dahua Lin, and Ji- aqi Wang. DualFocus: Integrating macro and micro per- spectives in multi-modal large language models. arXiv preprint arXiv:2402.14767, 2024. 2

  16. [16]

    ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Juny- ing Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jian- quan Li, Xiang Wan, and Benyou Wang. ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024. 8

  17. [17]

    Shikra: Unleashing multimodal llm’s referential dialogue magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 8

  18. [18]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 2, 8

  19. [19]

    Are We on the Right Way for Evaluating Large Vision-Language Models?

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024. 2, 9

  20. [20]

    ShareGPT4Video: Improving video understanding and generation with better captions

    Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024. 8

  21. [21]

    TabFact: A large-scale dataset for table-based fact verification

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. TabFact: A large-scale dataset for table-based fact verification. In Proceedings of the Inter- 11 national Conference on Learning Representations (ICLR) ,

  22. [22]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 8

  23. [23]

    Pali-x: On scaling up a multilingual vision and language model, 2023

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...

  24. [24]

    Pali-3 vision language models: Smaller, faster, stronger, 2023

    Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023

  25. [25]

    Pali: A jointly-scaled multilingual language- image model, 2023

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...

  26. [26]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023. 2

  27. [27]

    How far are we to gpt- 4v? closing the gap to commercial multimodal models with open-source suites, 2024

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong ...

  28. [28]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. 8

  29. [29]

    Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art

    Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. Icdar2019 robust read- ing challenge on arbitrary-shaped text-rrc-art. In Interna- tional Conference on Document Analysis and Recognition (ICDAR), 2019. 8

  30. [30]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 1, 2

  31. [31]

    Opencompass: A univer- sal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,

  32. [32]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  33. [33]

    Moura, Devi Parikh, and Dhruv Batra

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  34. [34]

    InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jing- wen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and compre...

  35. [35]

    Internlm-xcomposer2- 4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

    Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2- 4khd: A pioneering large vision-language mod...

  36. [36]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  37. [37]

    KTO: Model Alignment as Prospect Theoretic Optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: Model align- ment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024. 6

  38. [38]

    ActivityNet: A large-scale video 12 benchmark for human activity understanding

    Bernard Ghanem Fabian Caba Heilbron, Victor Escorcia and Juan Carlos Niebles. ActivityNet: A large-scale video 12 benchmark for human activity understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 2, 8

  39. [39]

    MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. MMBench-Video: A long-form multi-shot benchmark for holistic video under- standing. arXiv preprint arXiv:2406.14515, 2024. 2, 9

  40. [40]

    Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing, 2024

    Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing, 2024. 2

  41. [41]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 9

  42. [42]

    A challenger to gpt-4v? early explorations of gemini in visual expertise

    Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 1, 2

  43. [43]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075, 2024. 2, 9

  44. [44]

    ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. ChatGLM: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 9

  45. [45]

    Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023. 2, 9

  46. [46]

    Ma-lmm: Memory-augmented large multimodal model for long-term video understanding

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xue- fei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. arXiv preprint arXiv:2404.05726, 2024. 2

  47. [47]

    Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models

    Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,

  48. [48]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914 ,

  49. [49]

    CogAgent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 10

  50. [51]

    mPLUG-DocOwl 1.5: Unified structure learn- ing for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, et al. mPLUG-DocOwl 1.5: Unified structure learn- ing for ocr-free document understanding. arXiv preprint arXiv:2403.12895, 2024. 9

  51. [52]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. 8

  52. [53]

    From image to video, what do we need in multimodal llms? arXiv preprint arXiv:2404.11865, 2024

    Suyuan Huang, Haoxin Zhang, Yan Gao, Yao Hu, and Zengchang Qin. From image to video, what do we need in multimodal llms? arXiv preprint arXiv:2404.11865, 2024. 2

  53. [54]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 8

  54. [55]

    Video recap: Recursive captioning of hour-long videos

    Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Na- garajan, Lorenzo Torresani, and Gedas Bertasius. Video recap: Recursive captioning of hour-long videos. arXiv preprint arXiv:2402.13250, 2024. 2

  55. [56]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 1, 2

  56. [57]

    Mantis: Interleaved multi- image instruction tuning, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi- image instruction tuning, 2024. 2

  57. [58]

    Chat-univi: Unified visual representa- tion empowers large language models with image and video understanding

    Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representa- tion empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023. 2

  58. [59]

    DVQA: Understanding data visualizations via ques- tion answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via ques- tion answering. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  59. [60]

    Language repository for long video understanding

    Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S Ryoo. Language repository for long video understanding. arXiv preprint arXiv:2403.14622 ,

  60. [61]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for 13 neural language models. arXiv preprint arXiv:2001.08361,

  61. [62]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. 2, 8, 9

  62. [63]

    Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension

    Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 8

  63. [64]

    An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024

    Wonkyun Kim, Changin Choi, Wonseok Lee, and Wonjong Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024. 6

  64. [65]

    Unlock- ing the conversion of web screenshots into html code with the websight dataset

    Hugo Laurenc ¸on, L´eo Tronchon, and Victor Sanh. Unlock- ing the conversion of web screenshots into html code with the websight dataset. arXiv preprint arXiv:2403.09029 ,

  65. [66]

    Viquae, a dataset for knowledge-based visual question answering about named entities

    Paul Lerner, Olivier Ferret, Camille Guinaudeau, Herv ´e Le Borgne, Romaric Besanc ¸on, Jos´e G Moreno, and Jes ´us Lov´on Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In Pro- ceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages 3108–3120, 2022. 8

  66. [67]

    Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 9

  67. [68]

    Otterhd: A high-resolution multi- modality model, 2023

    Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high-resolution multi- modality model, 2023. 2

  68. [69]

    Otter: A multi-modal model with in-context instruction tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 2

  69. [70]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- hai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023. 2

  70. [71]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. arXiv preprint arXiv:2311.17005 ,

  71. [72]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 9

  72. [73]

    Silkie: Preference distillation for large visual language models

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual lan- guage models. arXiv preprint arXiv:2312.10665, 2023. 6

  73. [74]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023. 2

  74. [75]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-Gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814,

  75. [76]

    Super-clevr: A virtual benchmark to diagnose do- main robustness in visual reasoning

    Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose do- main robustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 8

  76. [77]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 2

  77. [78]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual represen- tation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023. 2

  78. [79]

    Vila: On pre-training for visual language models, 2024

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2024. 2, 9

  79. [80]

    SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023. 2

  80. [81]

    Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning

    Adam Dahlgren Lindstr ¨om and Savitha Sam Abra- ham. Clevr-math: A dataset for compositional lan- guage, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022. 8

Showing first 80 references.