pith. machine review for the scientific record. sign in

arxiv: 2406.09411 · v2 · submitted 2024-06-13 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords MuirBenchmulti-image understandingmultimodal LLMsbenchmarkscene understandingtemporal relationsmultiview relationsAI evaluation
0
0 comments X

The pith

MuirBench reveals that even leading multimodal LLMs like GPT-4o achieve only 68 percent accuracy on multi-image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MuirBench, a benchmark with 12 tasks and 10 categories of multi-image relations that uses 11,264 images and 2,600 multiple-choice questions. Each standard question is paired with a minimally different unanswerable variant to create a reliable test of whether models can handle information across several images rather than one. Tests on 20 models show closed-source leaders such as GPT-4o at 68 percent and Gemini Pro at 49.3 percent, while open-source models trained on single images remain below 33.3 percent. These gaps indicate that current systems have not learned to integrate relations like temporal order or multiview geometry across images. The benchmark is presented as a tool to push development of multimodal models that look beyond isolated images.

Core claim

MuirBench is constructed in a pairwise manner so that each standard multi-image instance has a corresponding unanswerable variant with only minimal semantic differences, allowing a direct measurement of robust multi-image understanding; evaluations across 20 recent multimodal LLMs show that even the strongest models reach at most 68 percent accuracy and that single-image-trained open-source models stay below 33.3 percent.

What carries the argument

MuirBench benchmark, which pairs each standard multi-image question with a minimally altered unanswerable variant to isolate genuine multi-image comprehension from single-image pattern matching.

If this is right

  • Open-source multimodal LLMs trained on single images fail to generalize to questions that require relations across multiple images.
  • Model development should incorporate training signals that explicitly address multi-image relations such as temporal and multiview connections.
  • The pairwise construction provides a concrete method for creating more reliable tests of integrated visual reasoning.
  • Applications that depend on comparing or ordering information from several images will remain limited until multi-image performance improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same minimal-difference pairing technique could be adapted to build harder tests for video or 3-D scene understanding.
  • Persistent gaps between closed- and open-source models suggest that scaling multi-image training data alone may be insufficient without new architectural components for cross-image attention.
  • If the benchmark becomes widely adopted, it could shift evaluation standards away from single-image accuracy toward explicit multi-image metrics.
  • Developers could use the unanswerable variants as negative examples during fine-tuning to reduce hallucinated answers when images are added.

Load-bearing premise

The assumption that pairing each standard instance with an unanswerable variant differing by only minimal semantic changes reliably isolates multi-image understanding without introducing new biases or artifacts in how the questions are written.

What would settle it

If a model scores substantially higher on MuirBench by detecting patterns specific to the unanswerable variants rather than by reasoning across the actual image sets, the benchmark would fail to measure true multi-image capability.

read the original abstract

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MuirBench, a benchmark for robust multi-image understanding in multimodal LLMs. It comprises 12 tasks involving 10 categories of multi-image relations (e.g., multiview, temporal), built from 11,264 images and 2,600 multiple-choice questions. The benchmark is constructed in a pairwise manner, with each standard instance paired to an unanswerable variant having minimal semantic differences for reliable assessment. Evaluations across 20 recent MLLMs show top closed-source models (GPT-4o at 68.0%, Gemini Pro at 49.3%) struggling, while open-source models trained on single images perform below 33.3%.

Significance. If the pairwise construction holds, MuirBench would be a useful addition to the field by exposing clear generalization gaps from single-image to multi-image settings and motivating targeted improvements in MLLM architectures. The design of pairing standard and unanswerable variants is a reasonable choice for isolating the claimed capabilities without relying on fitted parameters or self-referential derivations.

major comments (1)
  1. [Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.
minor comments (1)
  1. [Abstract] The abstract provides limited detail on the exact process for generating questions and performing human validation, which would help readers assess potential annotation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to strengthen the validation of the benchmark construction.

read point-by-point responses
  1. Referee: [Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.

    Authors: We agree that quantitative validation of the minimal semantic differences between standard and unanswerable variants is important to support the interpretation of the results. In the revised manuscript, we have added a new subsection (Section 3.3) reporting human similarity ratings collected from 50 annotators on a random sample of 200 pairs across all 10 relation categories and 12 tasks. Annotators rated semantic similarity on a 1-5 scale, yielding an average score of 4.6 with high inter-annotator agreement. We also include ablation studies examining the impact of question phrasing variations and alternative image selections on model performance for representative tasks in multiview, temporal, and spatial relation categories. These results are presented in the updated Methods section and Appendix C. We believe these additions directly address the concern and provide stronger empirical grounding for the benchmark's design. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark construction and evaluation

full rationale

The paper introduces MuirBench through explicit dataset construction (pairwise standard/unanswerable instances across 12 tasks and 10 relation categories) and reports raw accuracy numbers from running existing models on the fixed benchmark. No equations, fitted parameters, predictions, or derivations are claimed; results are direct measurements. The central claim (models struggle with multi-image understanding) rests on these measurements rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the reported numbers. This is a standard empirical benchmark paper with self-contained evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark rests on standard assumptions from computer vision and NLP about what constitutes distinct multi-image relations and how to construct reliable multiple-choice questions; no new free parameters, axioms beyond domain standards, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5574 in / 1167 out tokens · 46965 ms · 2026-05-17T01:04:18.981811+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  2. The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

    cs.CV 2026-05 unverdicted novelty 7.0

    MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.

  3. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.

  4. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.

  5. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  6. ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

    cs.CL 2026-04 unverdicted novelty 7.0

    ValueGround shows MLLMs drop from 72.8% text-only accuracy to 65.8% on visual cultural value grounding across 13 countries, despite 92.8% image-option alignment, with all models prone to prediction reversals.

  7. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  8. MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

    cs.CL 2026-04 unverdicted novelty 6.0

    MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

  9. S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

  10. Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks

    cs.CV 2026-03 unverdicted novelty 6.0

    PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.

  11. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  12. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  15. Context Unrolling in Omni Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

  16. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  17. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  18. LLM Harms: A Taxonomy and Discussion

    cs.CY 2025-12 unverdicted novelty 3.0

    This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 17 Pith papers · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...

  3. [3]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. 9

  4. [4]

    Visual question answering on image sets

    Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In European Conference on Computer Vision, pages 51–67, 2020

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging

    Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, et al. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002, 2024

  7. [7]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

  8. [8]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023

  9. [9]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

  10. [10]

    A picture is worth more words over time: Multimodality and narrative structure across eight decades of american superhero comics

    Neil Cohn, Ryan Taylor, and Kaitlin Pederson. A picture is worth more words over time: Multimodality and narrative structure across eight decades of american superhero comics. Multimodal Communication, 6(1):19–37, 2017

  11. [11]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

  12. [12]

    Xtuner: A toolkit for efficiently fine-tuning llm

    XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023

  13. [13]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  14. [14]

    Eva: Exploring the limits of masked visual representation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023

  15. [15]

    MIT press, 2001

    Olivier Faugeras, Quang-Tuan Luong, and Theo Papadopoulo.The geometry of multiple images: the laws that govern the formation of multiple images of a scene and some of their applications. MIT press, 2001

  16. [16]

    Unsupervised learning for physical in- teraction through video prediction

    Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical in- teraction through video prediction. Advances in neural information processing systems, 29, 2016

  17. [17]

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024

  18. [18]

    There’s a time and place for reasoning beyond the image

    Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl V ondrick, and Dan Roth. There’s a time and place for reasoning beyond the image. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

  19. [19]

    Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 10

  20. [20]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

  21. [21]

    Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963

    George L Gropper. Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963

  22. [22]

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023

  23. [23]

    Im2gps: estimating geographic information from a single image

    James Hays and Alexei A Efros. Im2gps: estimating geographic information from a single image. In 2008 ieee conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008

  24. [24]

    A picture is worth a thousand words: Using visual images to improve comprehension for middle school struggling readers

    Anne Nielsen Hibbing and Joan L Rankin-Erickson. A picture is worth a thousand words: Using visual images to improve comprehension for middle school struggling readers. The reading teacher, 56(8):758–770, 2003

  25. [25]

    Mantis: Interleaved multi-image instruction tuning

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024

  26. [26]

    Many-shot in-context learning in multimodal foundation models

    Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. arXiv preprint arXiv:2405.09798, 2024

  27. [27]

    Deep learning for multiple-image super-resolution

    Michal Kawulok, Pawel Benecki, Szymon Piechaczek, Krzysztof Hrynczenko, Daniel Kostrzewa, and Jakub Nalepa. Deep learning for multiple-image super-resolution. IEEE Geoscience and Remote Sensing Letters, 17(6):1062–1066, 2019

  28. [28]

    Obelics: An open web-scale filtered dataset of interleaved image-text documents

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024

  29. [29]

    What matters when building vision-language models?, 2024

    Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024

  30. [30]

    Seed-bench-2: Benchmarking multimodal large language models

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023

  31. [31]

    Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

    Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The Twelfth International Conference on Learning Representations, 2024

  32. [32]

    Vila: On pre-training for visual language models, 2023

    Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023

  33. [33]

    Visual spatial reasoning

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

  34. [34]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  35. [35]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  36. [36]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 11

  37. [37]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  38. [38]

    On the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

  39. [40]

    Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

    Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

  40. [41]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024

  41. [42]

    Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

  42. [43]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Process- ing Systems (NeurIPS), 2022

  43. [44]

    Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  44. [45]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023

  45. [46]

    ChartQA: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May

  46. [47]

    Association for Computational Linguistics

  47. [48]

    Unsolvable problem detection: Evaluating trustworthiness of vision language models

    Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable problem detection: Evaluating trustworthiness of vision language models. arXiv preprint arXiv:2403.20331, 2024

  48. [49]

    The effect of powerpoint presentations on student learning and attitudes

    Hossein Nouri and Abdus Shahid. The effect of powerpoint presentations on student learning and attitudes. Global perspectives on accounting education, 2:53, 2005

  49. [50]

    Action- conditional video prediction using deep networks in atari games.Advances in neural information processing systems, 28, 2015

    Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games.Advances in neural information processing systems, 28, 2015

  50. [51]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  51. [52]

    Know what you don’t know: Unanswerable ques- tions for squad

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018. 12

  52. [53]

    Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context- learning

    Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context- learning. In The Twelfth International Conference on Learning Representations, 2023

  53. [54]

    A corpus for reasoning about natural language grounded in photographs

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, 2019

  54. [55]

    D2s: Document- to-slide generation via query-based text summarization

    Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy XR Wang. D2s: Document- to-slide generation via query-based text summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418, 2021

  55. [56]

    Generative multimodal models are in-context learners

    Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023

  56. [57]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  57. [58]

    Chameleon: Mixed-modal early-fusion foundation models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. 2024

  58. [59]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  59. [60]

    Internlm: A multilingual language model with progressively enhanced capabilities

    InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023

  60. [61]

    Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

    MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05

  61. [62]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  62. [63]

    Genecis: A benchmark for general conditional image similarity

    Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023

  63. [64]

    Cogvlm: Visual expert for pretrained language models, 2023

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023

  64. [65]

    Mementos: A comprehensive bench- mark for multimodal large language model reasoning over image sequences

    Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, et al. Mementos: A comprehensive bench- mark for multimodal large language model reasoning over image sequences. arXiv preprint arXiv:2401.10529, 2024

  65. [66]

    Mmt-bench: A comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024

  66. [67]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023

  67. [68]

    Llama-adapter: Efficient fine-tuning of language models with zero-init attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In International Conference on Learning Representations (ICLR), 2024

  68. [69]

    MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024. 13

  69. [70]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024

  70. [71]

    Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective

    Zhedong Zheng, Yujiao Shi, Tingyu Wang, Jun Liu, Jianwu Fang, Yunchao Wei, and Tat-seng Chua. Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9715–9717, 2023

  71. [72]

    University-1652: A multi-view multi-source benchmark for drone-based geo-localization

    Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. ACM Multimedia, 2020

  72. [73]

    none of the other options

    Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024. 14 Appendices Table of Contents A M UIR BENCH Details 15 A.1 Dataset...