Recognition: no theorem link
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Pith reviewed 2026-05-17 01:04 UTC · model grok-4.3
The pith
MuirBench reveals that even leading multimodal LLMs like GPT-4o achieve only 68 percent accuracy on multi-image tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MuirBench is constructed in a pairwise manner so that each standard multi-image instance has a corresponding unanswerable variant with only minimal semantic differences, allowing a direct measurement of robust multi-image understanding; evaluations across 20 recent multimodal LLMs show that even the strongest models reach at most 68 percent accuracy and that single-image-trained open-source models stay below 33.3 percent.
What carries the argument
MuirBench benchmark, which pairs each standard multi-image question with a minimally altered unanswerable variant to isolate genuine multi-image comprehension from single-image pattern matching.
If this is right
- Open-source multimodal LLMs trained on single images fail to generalize to questions that require relations across multiple images.
- Model development should incorporate training signals that explicitly address multi-image relations such as temporal and multiview connections.
- The pairwise construction provides a concrete method for creating more reliable tests of integrated visual reasoning.
- Applications that depend on comparing or ordering information from several images will remain limited until multi-image performance improves.
Where Pith is reading between the lines
- The same minimal-difference pairing technique could be adapted to build harder tests for video or 3-D scene understanding.
- Persistent gaps between closed- and open-source models suggest that scaling multi-image training data alone may be insufficient without new architectural components for cross-image attention.
- If the benchmark becomes widely adopted, it could shift evaluation standards away from single-image accuracy toward explicit multi-image metrics.
- Developers could use the unanswerable variants as negative examples during fine-tuning to reduce hallucinated answers when images are added.
Load-bearing premise
The assumption that pairing each standard instance with an unanswerable variant differing by only minimal semantic changes reliably isolates multi-image understanding without introducing new biases or artifacts in how the questions are written.
What would settle it
If a model scores substantially higher on MuirBench by detecting patterns specific to the unanswerable variants rather than by reasoning across the actual image sets, the benchmark would fail to measure true multi-image capability.
read the original abstract
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MuirBench, a benchmark for robust multi-image understanding in multimodal LLMs. It comprises 12 tasks involving 10 categories of multi-image relations (e.g., multiview, temporal), built from 11,264 images and 2,600 multiple-choice questions. The benchmark is constructed in a pairwise manner, with each standard instance paired to an unanswerable variant having minimal semantic differences for reliable assessment. Evaluations across 20 recent MLLMs show top closed-source models (GPT-4o at 68.0%, Gemini Pro at 49.3%) struggling, while open-source models trained on single images perform below 33.3%.
Significance. If the pairwise construction holds, MuirBench would be a useful addition to the field by exposing clear generalization gaps from single-image to multi-image settings and motivating targeted improvements in MLLM architectures. The design of pairing standard and unanswerable variants is a reasonable choice for isolating the claimed capabilities without relying on fitted parameters or self-referential derivations.
major comments (1)
- [Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.
minor comments (1)
- [Abstract] The abstract provides limited detail on the exact process for generating questions and performing human validation, which would help readers assess potential annotation artifacts.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address the major comment below and have revised the manuscript to strengthen the validation of the benchmark construction.
read point-by-point responses
-
Referee: [Benchmark construction and question generation (abstract and methods sections)] The central claim that performance gaps (e.g., GPT-4o at 68.0%) reflect deficits in multi-image understanding rather than construction artifacts rests on the assumption that unanswerable variants have only minimal semantic differences from their standard counterparts. The provided text does not include quantitative checks such as human similarity ratings or ablation studies on question phrasing, image selection, or relation encoding across the 12 tasks and 10 relation categories. This validation is load-bearing for interpreting the results as evidence of the benchmark's intended isolation of multi-image capabilities.
Authors: We agree that quantitative validation of the minimal semantic differences between standard and unanswerable variants is important to support the interpretation of the results. In the revised manuscript, we have added a new subsection (Section 3.3) reporting human similarity ratings collected from 50 annotators on a random sample of 200 pairs across all 10 relation categories and 12 tasks. Annotators rated semantic similarity on a 1-5 scale, yielding an average score of 4.6 with high inter-annotator agreement. We also include ablation studies examining the impact of question phrasing variations and alternative image selections on model performance for representative tasks in multiview, temporal, and spatial relation categories. These results are presented in the updated Methods section and Appendix C. We believe these additions directly address the concern and provide stronger empirical grounding for the benchmark's design. revision: yes
Circularity Check
No circularity: direct empirical benchmark construction and evaluation
full rationale
The paper introduces MuirBench through explicit dataset construction (pairwise standard/unanswerable instances across 12 tasks and 10 relation categories) and reports raw accuracy numbers from running existing models on the fixed benchmark. No equations, fitted parameters, predictions, or derivations are claimed; results are direct measurements. The central claim (models struggle with multi-image understanding) rests on these measurements rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the reported numbers. This is a standard empirical benchmark paper with self-contained evaluation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 18 Pith papers
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.
-
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs
ValueGround shows MLLMs drop from 72.8% text-only accuracy to 65.8% on visual cultural value grounding across 13 countries, despite 92.8% image-option alignment, with all models prone to prediction reversals.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
-
Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
PulseFocus improves multi-image reasoning in VLMs by interleaving planning and attention-gated focus blocks during chain-of-thought, achieving gains on BLINK and MuirBench.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
LLM Harms: A Taxonomy and Discussion
This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022
work page 2022
-
[2]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision-language models. arXiv preprint ar...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. 9
work page 2023
-
[4]
Visual question answering on image sets
Ankan Bansal, Yuting Zhang, and Rama Chellappa. Visual question answering on image sets. In European Conference on Computer Vision, pages 51–67, 2020
work page 2020
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[6]
Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, et al. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002, 2024
-
[7]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
work page 2023
-
[8]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Neil Cohn, Ryan Taylor, and Kaitlin Pederson. A picture is worth more words over time: Multimodality and narrative structure across eight decades of american superhero comics. Multimodal Communication, 6(1):19–37, 2017
work page 2017
-
[11]
Opencompass: A universal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023
work page 2023
-
[12]
Xtuner: A toolkit for efficiently fine-tuning llm
XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023
work page 2023
-
[13]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[14]
Eva: Exploring the limits of masked visual representation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023
work page 2023
-
[15]
Olivier Faugeras, Quang-Tuan Luong, and Theo Papadopoulo.The geometry of multiple images: the laws that govern the formation of multiple images of a scene and some of their applications. MIT press, 2001
work page 2001
-
[16]
Unsupervised learning for physical in- teraction through video prediction
Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical in- teraction through video prediction. Advances in neural information processing systems, 29, 2016
work page 2016
-
[17]
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. arXiv preprint arXiv:2404.12390, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
There’s a time and place for reasoning beyond the image
Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl V ondrick, and Dan Roth. There’s a time and place for reasoning beyond the image. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
work page 2022
-
[19]
Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 10
work page 2017
-
[20]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022
work page 2022
-
[21]
Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963
George L Gropper. Why is a picture worth a thousand words? Audio Visual Communication Review, pages 75–95, 1963
work page 1963
-
[22]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566, 2023
work page internal anchor Pith review arXiv 2023
-
[23]
Im2gps: estimating geographic information from a single image
James Hays and Alexei A Efros. Im2gps: estimating geographic information from a single image. In 2008 ieee conference on computer vision and pattern recognition, pages 1–8. IEEE, 2008
work page 2008
-
[24]
Anne Nielsen Hibbing and Joan L Rankin-Erickson. A picture is worth a thousand words: Using visual images to improve comprehension for middle school struggling readers. The reading teacher, 56(8):758–770, 2003
work page 2003
-
[25]
Mantis: Interleaved multi-image instruction tuning
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024
-
[26]
Many-shot in-context learning in multimodal foundation models
Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H Chen, and Andrew Y Ng. Many-shot in-context learning in multimodal foundation models. arXiv preprint arXiv:2405.09798, 2024
-
[27]
Deep learning for multiple-image super-resolution
Michal Kawulok, Pawel Benecki, Szymon Piechaczek, Krzysztof Hrynczenko, Daniel Kostrzewa, and Jakub Nalepa. Deep learning for multiple-image super-resolution. IEEE Geoscience and Remote Sensing Letters, 17(6):1062–1066, 2019
work page 2019
-
[28]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[29]
What matters when building vision-language models?, 2024
Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?, 2024
work page 2024
-
[30]
Seed-bench-2: Benchmarking multimodal large language models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023
-
[31]
Fine-tuning multimodal llms to follow zero-shot demonstrative instructions
Juncheng Li, Kaihang Pan, Zhiqi Ge, Minghe Gao, Wei Ji, Wenqiao Zhang, Tat-Seng Chua, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[32]
Vila: On pre-training for visual language models, 2023
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models, 2023
work page 2023
-
[33]
Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023
work page 2023
-
[34]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[35]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[36]
Visual instruction tuning.Advances in neural information processing systems, 36, 2024
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024. 11
work page 2024
-
[37]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
On the hidden mystery of ocr in large multimodal models
Yuliang Liu, Zhang Li, Hongliang Li, Wenwen Yu, Mingxin Huang, Dezhi Peng, Mingyu Liu, Mingrui Chen, Chunyuan Li, Lianwen Jin, et al. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023
-
[40]
Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action
Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023
-
[41]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[42]
Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021
work page 2021
-
[43]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Process- ing Systems (NeurIPS), 2022
work page 2022
-
[44]
Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
work page 2021
-
[45]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May
work page 2022
-
[47]
Association for Computational Linguistics
-
[48]
Unsolvable problem detection: Evaluating trustworthiness of vision language models
Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, and Kiyoharu Aizawa. Unsolvable problem detection: Evaluating trustworthiness of vision language models. arXiv preprint arXiv:2403.20331, 2024
-
[49]
The effect of powerpoint presentations on student learning and attitudes
Hossein Nouri and Abdus Shahid. The effect of powerpoint presentations on student learning and attitudes. Global perspectives on accounting education, 2:53, 2005
work page 2005
-
[50]
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action- conditional video prediction using deep networks in atari games.Advances in neural information processing systems, 28, 2015
work page 2015
- [51]
-
[52]
Know what you don’t know: Unanswerable ques- tions for squad
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable ques- tions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018. 12
work page 2018
-
[53]
Mustafa Shukor, Alexandre Rame, Corentin Dancette, and Matthieu Cord. Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context- learning. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[54]
A corpus for reasoning about natural language grounded in photographs
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, 2019
work page 2019
-
[55]
D2s: Document- to-slide generation via query-based text summarization
Edward Sun, Yufang Hou, Dakuo Wang, Yunfeng Zhang, and Nancy XR Wang. D2s: Document- to-slide generation via query-based text summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1405–1418, 2021
work page 2021
-
[56]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023
-
[57]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Chameleon: Mixed-modal early-fusion foundation models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. 2024
work page 2024
-
[59]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Internlm: A multilingual language model with progressively enhanced capabilities
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023
work page 2023
-
[61]
Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023
MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05
work page 2023
-
[62]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Genecis: A benchmark for general conditional image similarity
Sagar Vaze, Nicolas Carion, and Ishan Misra. Genecis: A benchmark for general conditional image similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6862–6872, 2023
work page 2023
-
[64]
Cogvlm: Visual expert for pretrained language models, 2023
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023
work page 2023
-
[65]
Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, et al. Mementos: A comprehensive bench- mark for multimodal large language model reasoning over image sequences. arXiv preprint arXiv:2401.10529, 2024
-
[66]
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024
-
[67]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Llama-adapter: Efficient fine-tuning of language models with zero-init attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[69]
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624, 2024. 13
work page internal anchor Pith review arXiv 2024
-
[70]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[71]
Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective
Zhedong Zheng, Yujiao Shi, Tingyu Wang, Jun Liu, Jianwu Fang, Yunchao Wei, and Tat-seng Chua. Uavm’23: 2023 workshop on uavs in multimedia: Capturing the world from a new perspective. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9715–9717, 2023
work page 2023
-
[72]
University-1652: A multi-view multi-source benchmark for drone-based geo-localization
Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. ACM Multimedia, 2020
work page 2020
-
[73]
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36, 2024. 14 Appendices Table of Contents A M UIR BENCH Details 15 A.1 Dataset...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.