arxiv: 2406.09411 · v2 · submitted 2024-06-13 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang , Xingyu Fu , James Y. Huang , Zekun Li , Qin Liu , Xiaogeng Liu , Mingyu Derek Ma , Nan Xu

show 13 more authors

Wenxuan Zhou Kai Zhang Tianyi Lorena Yan Wenjie Jacky Mo Hsiang-Hui Liu Pan Lu Chunyuan Li Chaowei Xiao Kai-Wei Chang Dan Roth Sheng Zhang Hoifung Poon Muhao Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords MuirBenchmulti-image understandingmultimodal LLMsbenchmarkscene understandingtemporal relationsmultiview relationsAI evaluation

0 comments

The pith

MuirBench reveals that even leading multimodal LLMs like GPT-4o achieve only 68 percent accuracy on multi-image tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MuirBench, a benchmark with 12 tasks and 10 categories of multi-image relations that uses 11,264 images and 2,600 multiple-choice questions. Each standard question is paired with a minimally different unanswerable variant to create a reliable test of whether models can handle information across several images rather than one. Tests on 20 models show closed-source leaders such as GPT-4o at 68 percent and Gemini Pro at 49.3 percent, while open-source models trained on single images remain below 33.3 percent. These gaps indicate that current systems have not learned to integrate relations like temporal order or multiview geometry across images. The benchmark is presented as a tool to push development of multimodal models that look beyond isolated images.

Core claim

MuirBench is constructed in a pairwise manner so that each standard multi-image instance has a corresponding unanswerable variant with only minimal semantic differences, allowing a direct measurement of robust multi-image understanding; evaluations across 20 recent multimodal LLMs show that even the strongest models reach at most 68 percent accuracy and that single-image-trained open-source models stay below 33.3 percent.

What carries the argument

MuirBench benchmark, which pairs each standard multi-image question with a minimally altered unanswerable variant to isolate genuine multi-image comprehension from single-image pattern matching.

If this is right

Open-source multimodal LLMs trained on single images fail to generalize to questions that require relations across multiple images.
Model development should incorporate training signals that explicitly address multi-image relations such as temporal and multiview connections.
The pairwise construction provides a concrete method for creating more reliable tests of integrated visual reasoning.
Applications that depend on comparing or ordering information from several images will remain limited until multi-image performance improves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-difference pairing technique could be adapted to build harder tests for video or 3-D scene understanding.
Persistent gaps between closed- and open-source models suggest that scaling multi-image training data alone may be insufficient without new architectural components for cross-image attention.
If the benchmark becomes widely adopted, it could shift evaluation standards away from single-image accuracy toward explicit multi-image metrics.
Developers could use the unanswerable variants as negative examples during fine-tuning to reduce hallucinated answers when images are added.

Load-bearing premise

The assumption that pairing each standard instance with an unanswerable variant differing by only minimal semantic changes reliably isolates multi-image understanding without introducing new biases or artifacts in how the questions are written.

What would settle it

If a model scores substantially higher on MuirBench by detecting patterns specific to the unanswerable variants rather than by reasoning across the actual image sets, the benchmark would fail to measure true multi-image capability.

read the original abstract

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuirBench is a useful new benchmark that shows clear gaps in multi-image reasoning for VLMs, but the pairwise unanswerable construction needs tighter checks to confirm it isolates the intended capabilities.

read the letter

MuirBench is a new benchmark for multi-image understanding that does a good job showing where current models fall short, especially on tasks requiring comparison or sequencing across images. The paper brings in 12 tasks spanning 10 relation categories such as temporal and multiview relations. It uses a pairwise setup where each standard question comes with an unanswerable variant that differs only minimally in semantics. This is meant to give a cleaner signal on whether models truly understand the multi-image aspects. They test 20 models and find GPT-4o at 68% and Gemini at 49%, with open-source ones under 33%. That lines up with the idea that single-image training doesn't cut it for these cases. The evaluation is broad and the numbers are useful for anyone tracking progress in this area. The scale, with over 11,000 images and 2,600 questions, adds weight to the results. The main soft spot is the validation of that pairwise construction. The assumption that the unanswerable versions keep semantic differences minimal is important for the interpretation. Without more on how they checked for artifacts or human agreement on the variants, it's possible some of the difficulty comes from the way questions are worded rather than the multi-image reasoning itself. The abstract is light on the generation process too. This is worth attention from people working on multimodal models for domains that involve multiple views or sequences, like robotics or analysis of image sets. It gives concrete tasks and baselines to build on. I would recommend sending it for peer review. The benchmark idea is timely and the initial results are worth a closer look from referees who can check the construction details.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MuirBench, a benchmark for robust multi-image understanding in multimodal LLMs. It comprises 12 tasks involving 10 categories of multi-image relations (e.g., multiview, temporal), built from 11,264 images and 2,600 multiple-choice questions. The benchmark is constructed in a pairwise manner, with each standard instance paired to an unanswerable variant having minimal semantic differences for reliable assessment. Evaluations across 20 recent MLLMs show top closed-source models (GPT-4o at 68.0%, Gemini Pro at 49.3%) struggling, while open-source models trained on single images perform below 33.3%.

Significance. If the pairwise construction holds, MuirBench would be a useful addition to the field by exposing clear generalization gaps from single-image to multi-image settings and motivating targeted improvements in MLLM architectures. The design of pairing standard and unanswerable variants is a reasonable choice for isolating the claimed capabilities without relying on fitted parameters or self-referential derivations.

major comments (1)

[Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.

minor comments (1)

[Abstract] The abstract provides limited detail on the exact process for generating questions and performing human validation, which would help readers assess potential annotation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to strengthen the validation of the benchmark construction.

read point-by-point responses

Referee: [Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.

Authors: We agree that quantitative validation of the minimal semantic differences between standard and unanswerable variants is important to support the interpretation of the results. In the revised manuscript, we have added a new subsection (Section 3.3) reporting human similarity ratings collected from 50 annotators on a random sample of 200 pairs across all 10 relation categories and 12 tasks. Annotators rated semantic similarity on a 1-5 scale, yielding an average score of 4.6 with high inter-annotator agreement. We also include ablation studies examining the impact of question phrasing variations and alternative image selections on model performance for representative tasks in multiview, temporal, and spatial relation categories. These results are presented in the updated Methods section and Appendix C. We believe these additions directly address the concern and provide stronger empirical grounding for the benchmark's design. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark construction and evaluation

full rationale

The paper introduces MuirBench through explicit dataset construction (pairwise standard/unanswerable instances across 12 tasks and 10 relation categories) and reports raw accuracy numbers from running existing models on the fixed benchmark. No equations, fitted parameters, predictions, or derivations are claimed; results are direct measurements. The central claim (models struggle with multi-image understanding) rests on these measurements rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the reported numbers. This is a standard empirical benchmark paper with self-contained evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark rests on standard assumptions from computer vision and NLP about what constitutes distinct multi-image relations and how to construct reliable multiple-choice questions; no new free parameters, axioms beyond domain standards, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5574 in / 1167 out tokens · 46965 ms · 2026-05-17T01:04:18.981811+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
cs.CV 2026-04 unverdicted novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
cs.CV 2026-04 unverdicted novelty 7.0

COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
cs.CL 2026-04 unverdicted novelty 7.0

ValueGround shows MLLMs drop from 72.8% text-only accuracy to 65.8% on visual cultural value grounding across 13 countries, despite 92.8% image-option alignment, with all models prone to prediction reversals.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
cs.CV 2026-03 unverdicted novelty 6.0

PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
LLaVA-OneVision: Easy Visual Task Transfer
cs.CV 2024-08 unverdicted novelty 5.0

LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
LLM Harms: A Taxonomy and Discussion
cs.CY 2025-12 unverdicted novelty 3.0

This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 17 Pith papers · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. 9

work page 2023
[4]

Visual question answering on image sets

Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In European Conference on Computer Vision, pages 51–67, 2020

work page 2020
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[6]

Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, et al. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002, 2024

work page arXiv 2024
[7]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023
[8]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

A picture is worth more words over time: Multimodality and narrative structure across eight decades of american superhero comics

Neil Cohn, Ryan Taylor, and Kaitlin Pederson. A picture is worth more words over time: Multimodality and narrative structure across eight decades of american superhero comics. Multimodal Communication, 6(1):19–37, 2017

work page 2017
[11]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

work page 2023
[12]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023

work page 2023
[13]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[14]

Eva: Exploring the limits of masked visual representation learning at scale

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023

work page 2023
[15]

MIT press, 2001

Olivier Faugeras, Quang-Tuan Luong, and Theo Papadopoulo.The geometry of multiple images: the laws that govern the formation of multiple images of a scene and some of their applications. MIT press, 2001

work page 2001
[16]

Unsupervised learning for physical in- teraction through video prediction

Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical in- teraction through video prediction. Advances in neural information processing systems, 29, 2016

work page 2016
[17]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

There’s a time and place for reasoning beyond the image

Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl V ondrick, and Dan Roth. There’s a time and place for reasoning beyond the image. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

work page 2022
[19]

Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 10

work page 2017
[20]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022

work page 2022
[21]

Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963

George L Gropper. Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963

work page 1963
[22]

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023

work page internal anchor Pith review arXiv 2023
[23]

Im2gps: estimating geographic information from a single image

James Hays and Alexei A Efros. Im2gps: estimating geographic information from a single image. In 2008 ieee conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008

work page 2008
[24]

A picture is worth a thousand words: Using visual images to improve comprehension for middle school struggling readers

Anne Nielsen Hibbing and Joan L Rankin-Erickson. A picture is worth a thousand words: Using visual images to improve comprehension for middle school struggling readers. The reading teacher, 56(8):758–770, 2003

work page 2003
[25]

Mantis: Interleaved multi-image instruction tuning

Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024

work page arXiv 2024
[26]

Many-shot in-context learning in multimodal foundation models

Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. arXiv preprint arXiv:2405.09798, 2024

work page arXiv 2024
[27]

Deep learning for multiple-image super-resolution

Michal Kawulok, Pawel Benecki, Szymon Piechaczek, Krzysztof Hrynczenko, Daniel Kostrzewa, and Jakub Nalepa. Deep learning for multiple-image super-resolution. IEEE Geoscience and Remote Sensing Letters, 17(6):1062–1066, 2019

work page 2019
[28]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[29]

What matters when building vision-language models?, 2024

Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024

work page 2024
[30]

Seed-bench-2: Benchmarking multimodal large language models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023

work page arXiv 2023
[31]

Fine-tuning multimodal llms to follow zero-shot demonstrative instructions

Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[32]

Vila: On pre-training for visual language models, 2023

Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023

work page 2023
[33]

Visual spatial reasoning

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[34]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[35]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[36]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 11

work page 2024
[37]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

On the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023

work page arXiv 2023
[40]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023

work page arXiv 2023
[41]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[42]

Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning

Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

work page 2021
[43]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Process- ing Systems (NeurIPS), 2022

work page 2022
[44]

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[45]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

ChartQA: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May

work page 2022
[47]

Association for Computational Linguistics

work page
[48]

Unsolvable problem detection: Evaluating trustworthiness of vision language models

Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable problem detection: Evaluating trustworthiness of vision language models. arXiv preprint arXiv:2403.20331, 2024

work page arXiv 2024
[49]

The effect of powerpoint presentations on student learning and attitudes

Hossein Nouri and Abdus Shahid. The effect of powerpoint presentations on student learning and attitudes. Global perspectives on accounting education, 2:53, 2005

work page 2005
[50]

Action- conditional video prediction using deep networks in atari games.Advances in neural information processing systems, 28, 2015

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games.Advances in neural information processing systems, 28, 2015

work page 2015
[51]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[52]

Know what you don’t know: Unanswerable ques- tions for squad

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018. 12

work page 2018
[53]

Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context- learning

Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context- learning. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[54]

A corpus for reasoning about natural language grounded in photographs

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, 2019

work page 2019
[55]

D2s: Document- to-slide generation via query-based text summarization

Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy XR Wang. D2s: Document- to-slide generation via query-based text summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418, 2021

work page 2021
[56]

Generative multimodal models are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023

work page arXiv 2023
[57]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Chameleon: Mixed-modal early-fusion foundation models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. 2024

work page 2024
[59]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Internlm: A multilingual language model with progressively enhanced capabilities

InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023

work page 2023
[61]

Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023

MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05

work page 2023
[62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

Genecis: A benchmark for general conditional image similarity

Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023

work page 2023
[64]

Cogvlm: Visual expert for pretrained language models, 2023

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023

work page 2023
[65]

Mementos: A comprehensive bench- mark for multimodal large language model reasoning over image sequences

Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, et al. Mementos: A comprehensive bench- mark for multimodal large language model reasoning over image sequences. arXiv preprint arXiv:2401.10529, 2024

work page arXiv 2024
[66]

Mmt-bench: A comprehensive multimodal bench- mark for evaluating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024

work page arXiv 2024
[67]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Llama-adapter: Efficient fine-tuning of language models with zero-init attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[69]

MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024. 13

work page internal anchor Pith review arXiv 2024
[70]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[71]

Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective

Zhedong Zheng, Yujiao Shi, Tingyu Wang, Jun Liu, Jianwu Fang, Yunchao Wei, and Tat-seng Chua. Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9715–9717, 2023

work page 2023
[72]

University-1652: A multi-view multi-source benchmark for drone-based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. ACM Multimedia, 2020

work page 2020
[73]

none of the other options

Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024. 14 Appendices Table of Contents A M UIR BENCH Details 15 A.1 Dataset...

work page 2024