pith. machine review for the scientific record. sign in

arxiv: 2409.02813 · v3 · submitted 2024-09-04 · 💻 cs.CL · cs.CV

Recognition: no theorem link

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Authors on Pith no claims yet

Pith reviewed 2026-05-14 00:45 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords MMMU-Promultimodal benchmarkvision-language modelsreasoning capabilitiesOCR promptschain of thoughttext-only filtering
0
0 comments X

The pith

Multimodal models show 17 to 27 percent lower accuracy on MMMU-Pro than on MMMU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMMU-Pro as a stricter test for multimodal AI by removing questions that text-only models can solve, adding more answer choices, and embedding the questions inside images. This forces models to integrate visual and textual information without relying on text shortcuts. Performance drops substantially across models, indicating current systems often bypass true visual understanding. The work also finds that chain-of-thought prompting helps while simple OCR prompts do not. Such a benchmark matters because real-world tasks require seamless seeing and reading together.

Core claim

MMMU-Pro refines the original MMMU benchmark through a three-step process of filtering text-only solvable questions, augmenting options, and introducing a vision-only setting with embedded questions. This results in model performances dropping between 16.8% and 26.9% compared to MMMU, demonstrating that previous evaluations overestimated multimodal capabilities by allowing non-visual shortcuts.

What carries the argument

The MMMU-Pro benchmark, built by filtering out text-answerable questions, expanding multiple choices, and embedding questions in images for vision-only input.

If this is right

  • Multimodal models rely more on text than visual cues in standard benchmarks.
  • Chain of thought reasoning boosts performance on the harder MMMU-Pro.
  • OCR prompts provide little additional benefit.
  • Future multimodal research should prioritize integrated visual-textual reasoning.
  • The benchmark better mimics real-world scenarios requiring simultaneous seeing and reading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might need to redesign models to handle text embedded in images more effectively.
  • Benchmarks should routinely include vision-only modes to prevent text leakage.
  • The performance gap highlights a need for better training data that mixes visual and textual elements inseparably.

Load-bearing premise

Questions solvable by text-only models truly require no visual understanding, and embedding questions in images tests integration without new biases or confounds.

What would settle it

A model achieving comparable accuracy on MMMU-Pro to MMMU, or maintaining high performance in the vision-only embedded setting without specific training for it, would indicate the robustness claim does not hold.

read the original abstract

This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly "see" and "read" simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MMMU-Pro, a refined version of the MMMU benchmark for multimodal understanding. It applies a three-step process to the original dataset: (1) filtering questions answerable by text-only models, (2) augmenting candidate options, and (3) embedding questions within images to create a vision-only input setting. Experiments across multiple models report performance drops of 16.8% to 26.9% relative to MMMU, with additional analysis showing minimal impact from OCR prompts and general improvement from Chain-of-Thought reasoning. The work positions MMMU-Pro as a more rigorous test of seamless visual-textual integration.

Significance. If the reported performance drops primarily reflect stricter requirements for multimodal reasoning rather than new perceptual confounds, MMMU-Pro would offer a useful, more challenging benchmark that better approximates real-world scenarios requiring integrated vision and language. The empirical construction is straightforward and the consistent drops across models provide a clear signal of current limitations, though the absence of isolating ablations limits the strength of the causal interpretation.

major comments (2)
  1. [three-step process and results] The vision-only embedding step (described in the three-step process and results) is central to the claim that MMMU-Pro better measures true visual-textual integration. However, no ablation is presented that compares model performance on the filtered and augmented questions when text is provided directly versus when it is embedded in images. This leaves open the possibility that the 16.8–26.9% drop arises from vision-encoder difficulties with text legibility, layout, or rendering rather than from the intended reasoning challenge.
  2. [experiments] The observation that OCR prompts have minimal effect (reported in the experiments) does not isolate whether the base visual extraction step itself is the bottleneck, as it tests only explicit prompting rather than the underlying perceptual capability on the embedded images.
minor comments (2)
  1. [methods] Exact criteria and thresholds used for filtering questions answerable by text-only models are not fully detailed, which would aid reproducibility of the exact MMMU-Pro dataset.
  2. [results] Consider adding error bars or statistical tests for the performance differences across models to strengthen the quantitative claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment point by point below.

read point-by-point responses
  1. Referee: The vision-only embedding step (described in the three-step process and results) is central to the claim that MMMU-Pro better measures true visual-textual integration. However, no ablation is presented that compares model performance on the filtered and augmented questions when text is provided directly versus when it is embedded in images. This leaves open the possibility that the 16.8–26.9% drop arises from vision-encoder difficulties with text legibility, layout, or rendering rather than from the intended reasoning challenge.

    Authors: We acknowledge that a direct ablation comparing text-provided versus embedded versions of the filtered and augmented questions would strengthen the causal interpretation. However, our OCR prompt experiments provide relevant evidence: explicitly supplying extracted text from the embedded images yields only minimal gains. This indicates the drop is unlikely to stem primarily from legibility or rendering issues, but rather from the demands of integrated reasoning. We will add an expanded discussion of this evidence and its implications in the revised manuscript. revision: yes

  2. Referee: The observation that OCR prompts have minimal effect (reported in the experiments) does not isolate whether the base visual extraction step itself is the bottleneck, as it tests only explicit prompting rather than the underlying perceptual capability on the embedded images.

    Authors: We agree that OCR prompting does not fully isolate inherent perceptual extraction capabilities independent of prompting. That said, the minimal effect even when text is explicitly provided still highlights that current models struggle with the integrated reasoning task central to MMMU-Pro. This supports the benchmark's utility for evaluating seamless visual-textual integration without requiring further changes to the manuscript. revision: no

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction

full rationale

The paper constructs MMMU-Pro via three explicit steps on the prior MMMU dataset—text-only filtering, option augmentation, and vision-only image embedding—then reports empirical accuracy drops on external models. No equations, parameter fitting, derivations, or self-citations reduce any claim to its own inputs by construction. The performance range (16.8–26.9%) is an observed measurement, not a forced prediction or renamed input. The work is self-contained against external model evaluations and contains no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about what counts as true multimodal understanding rather than free parameters or new invented entities.

axioms (2)
  • domain assumption Questions answerable by text-only models do not require visual understanding and can be safely filtered out.
    This premise underpins the first filtering step and is invoked to justify removal of certain questions.
  • domain assumption Embedding questions inside images cleanly tests integrated seeing-and-reading without introducing unrelated biases.
    This underpins the vision-only input setting and the claim that it mimics real-world cognitive skills.

pith-pipeline@v0.9.0 · 5517 in / 1365 out tokens · 58622 ms · 2026-05-14T00:45:48.228603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  2. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  3. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a new benchmark for measuring MLLMs' ability to recover fine-grained image-text correspondences in interleaved multimodal contexts.

  4. COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

    cs.CV 2026-04 unverdicted novelty 7.0

    COHERENCE is a benchmark for MLLMs' fine-grained image-text alignment in interleaved multimodal contexts across four domains, with 6161 questions and six-type error analysis.

  5. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  6. GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing

    cs.CV 2026-04 unverdicted novelty 7.0

    GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.

  7. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  8. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  9. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  10. Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

    cs.CV 2026-04 unverdicted novelty 6.0

    Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

  11. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  12. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  13. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  14. OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

    cs.CL 2026-05 unverdicted novelty 5.0

    OmniThoughtVis curates 1.8M multimodal CoT samples via teacher distillation, difficulty annotation, and tag-based sampling, yielding consistent gains on nine reasoning benchmarks and allowing 4B models to match or bea...

  15. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  16. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  17. SALLIE: Safeguarding Against Latent Language & Image Exploits

    cs.CR 2026-04 unverdicted novelty 5.0

    SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.

  18. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  19. Qwen2.5-Omni Technical Report

    cs.CL 2025-03 conditional novelty 5.0

    Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

  20. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    cs.CL 2025-03 unverdicted novelty 5.0

    Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.

  21. Measuring AI Reasoning: A Guide for Researchers

    cs.AI 2026-05 unverdicted novelty 4.0

    Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.

  22. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

  23. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 21 Pith papers · 18 internal anchors

  1. [1]

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. https://arxiv.org/abs/2404.14219 Phi-3 technical report: A highly capable language model locally on your phone . ArXiv preprint, abs/2404.14219

  2. [2]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems

  3. [3]

    Anthropic. 2024. https://www.anthropic.com/news/claude-3-5-sonnet Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet

  4. [4]

    In: ICCV (2015),https://doi.org/ 10.1109/ICCV.2015.279

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. https://doi.org/10.1109/ICCV.2015.279 VQA: visual question answering . In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015 , pages 2425--2433. IEEE Computer Society

  5. [5]

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. https://arxiv.org/abs/2308.01390 Openflamingo: An open-source framework for training large autoregressive vision-language models . ArXiv preprint, abs/2308.01390

  6. [6]

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, pages 104--120

  7. [7]

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. https://arxiv.org/abs/2404.16821 How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites . ArXiv preprint, abs/2404.16821

  8. [8]

    Chenhang Cui, Yiyang Zhou, Xinyu Yang, Shirley Wu, Linjun Zhang, James Zou, and Huaxiu Yao. 2023. https://arxiv.org/abs/2311.03287 Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges . ArXiv preprint, abs/2311.03287

  9. [9]

    Wenliang Dai, Junnan Li, DONGXU LI, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/9a6a435e75419a836fe47ab6793623e6-Paper-Conference.pdf Instructblip: Towards general-purpose vision-language models with instruction tuning . In Advances in Neural Informat...

  10. [10]

    Mengnan Du, Fengxiang He, Na Zou, Dacheng Tao, and Xia Hu. 2023. Shortcut learning of large language models in natural language understanding. Communications of the ACM, 67(1):110--120

  11. [11]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . ArXiv preprint, abs/2407.21783

  12. [12]

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. 2024. https://arxiv.org/abs/2404.12390 Blink: Multimodal large language models can see but not perceive . ArXiv preprint, abs/2404.12390

  13. [13]

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. 2023. https://arxiv.org/abs/2304.15010 Llama-adapter v2: Parameter-efficient visual instruction model . ArXiv preprint, abs/2304.15010

  14. [14]

    Yash Goyal, Tejas Khot, Douglas Summers - Stay, Dhruv Batra, and Devi Parikh. 2017. https://doi.org/10.1109/CVPR.2017.670 Making the V in VQA matter: Elevating the role of image understanding in visual question answering . In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 6325--6334....

  15. [15]

    gpt-4o. 2024. https://mistral.ai/news/mixtral-8x22b/ Cheaper, better, faster, stronger. https://mistral.ai/news/mixtral-8x22b/

  16. [16]

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. 2024. https://arxiv.org/abs/2405.01483 Mantis: Interleaved multi-image instruction tuning . ArXiv preprint, abs/2405.01483

  17. [17]

    Yizhang Jin, Jian Li, Yexin Liu, Tianjun Gu, Kai Wu, Zhengkai Jiang, Muyang He, Bo Zhao, Xin Tan, Zhenye Gan, et al. 2024. https://arxiv.org/abs/2405.10739 Efficient multimodal large language models: A survey . ArXiv preprint, abs/2405.10739

  18. [18]

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. 2024. https://aclanthology.org/2024.acl-long.50 V isual W eb A rena: Evaluating multimodal agents on realistic visual web tasks . In Proceedings of the 62nd Annual Meeting of the Association for Computational Lingu...

  19. [19]

    Hugo Lauren c on, Andr \'e s Marafioti, Victor Sanh, and L \'e o Tronchon. 2024. https://arxiv.org/abs/2408.12637 Building and better understanding vision-language models: insights and future directions . ArXiv preprint, abs/2408.12637

  20. [20]

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. https://arxiv.org/abs/2305.03726 Otter: A multi-modal model with in-context instruction tuning . ArXiv preprint, abs/2305.03726

  21. [21]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024 a . https://arxiv.org/abs/2408.03326 Llava-onevision: Easy visual task transfer . ArXiv preprint, abs/2408.03326

  22. [22]

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024 b . Seed-bench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13299--13308

  23. [23]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024 c . https://arxiv.org/abs/2407.07895 Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models . ArXiv preprint, abs/2407.07895

  24. [24]

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXX 16, pages 121--137. Springer

  25. [25]

    Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. 2024 d . https://arxiv.org/abs/2407.04903 Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension . ArXiv preprint, abs/2407.04903

  26. [26]

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. 2024. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689--26699

  27. [27]

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740--755. Springer

  28. [28]

    Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023 a . https://arxiv.org/abs/2310.14566 Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models . ArXiv preprint, abs/2310.14566

  29. [29]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023 b . https://openreview.net/forum?id=yx3Hkx5ved Improved baselines with visual instruction tuning . In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

  30. [30]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 a . https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

  31. [31]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 c . https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf Visual instruction tuning . In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916. Curran Associates, Inc

  32. [32]

    Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. 2024 b . Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? Conference on Language Modeling

  33. [33]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023 d . https://arxiv.org/abs/2307.06281 Mmbench: Is your multi-modal model an all-around player? ArXiv preprint, abs/2307.06281

  34. [34]

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks . In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processi...

  35. [35]

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023 a . https://arxiv.org/abs/2310.02255 Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts . ArXiv preprint, abs/2310.02255

  36. [36]

    Yujie Lu, Xiujun Li, William Yang Wang, and Yejin Choi. 2023 b . https://arxiv.org/abs/2311.17647 Vim: Probing multimodal large language models for visual embedded instruction following . ArXiv preprint, abs/2311.17647

  37. [37]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. https://doi.org/10.1109/CVPR.2019.00331 OK-VQA: A visual question answering benchmark requiring external knowledge . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 3195--3204. Computer Vision Foundation / IEEE

  38. [38]

    Mistral. 2024. https://mistral.ai/news/pixtral-12b Pixtral-12b. https://mistral.ai/news/pixtral-12b

  39. [39]

    Masoud Monajatipoor, Liunian Harold Li, Mozhdeh Rouhsedaghat, Lin Yang, and Kai-Wei Chang. 2023. https://doi.org/10.18653/v1/2023.acl-short.43 M eta VL : Transferring in-context learning ability from language models to vision-language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Paper...

  40. [40]

    OpenAI. 2023. https://cdn.openai.com/papers/GPTV_System_Card.pdf Gpt-4v(ision) system card

  41. [41]

    OpenAI. 2024 a . https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

  42. [42]

    OpenAI. 2024 b . https://openai.com/index/hello-gpt-4o/ Hello gpt4-o. https://openai.com/index/hello-gpt-4o/

  43. [43]

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193

  44. [44]

    Qwen. 2024. Qwen2-vl: To see the world more clearly. https://qwenlm.github.io/blog/qwen2-vl/

  45. [45]

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. https://arxiv.org/abs/2403.05530 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context . ArXiv preprint, abs/2403.05530

  46. [46]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. https://arxiv.org/abs/2312.11805 Gemini: a family of highly capable multimodal models . ArXiv preprint, abs/2312.11805

  47. [47]

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. 2024 a . https://arxiv.org/abs/2406.16860 Cambrian-1: A fully open, vision-centric exploration of multimodal llms . ArXiv preprint, abs/2406.16860

  48. [48]

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024 b . Eyes wide shut? exploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568--9578

  49. [49]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024. https://arxiv.org/abs/2406.01574 Mmlu-pro: A more robust and challenging multi-task language understanding benchmark . ArXiv preprint, abs/2406.01574

  50. [50]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

  51. [51]

    Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui. 2024. From decoding to meta-generation: Inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838

  52. [52]

    Penghao Wu and Saining Xie. 2024. V*: Guided visual search as a core mechanism in multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084--13094

  53. [53]

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. 2023. https://arxiv.org/abs/2306.09265 Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models . ArXiv preprint, abs/2306.09265

  54. [54]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. 2024. https://arxiv.org/abs/2407.10671 Qwen2 technical report . ArXiv preprint, abs/2407.10671

  55. [55]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. https://arxiv.org/abs/2408.01800 Minicpm-v: A gpt-4v level mllm on your phone . ArXiv preprint, abs/2408.01800

  56. [56]

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023 a . https://arxiv.org/abs/2304.14178 mplug-owl: Modularization empowers large language models with multimodality . ArXiv preprint, abs/2304.14178

  57. [57]

    Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2023 b . https://arxiv.org/abs/2311.04257 mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration . ArXiv preprint, abs/2311.04257

  58. [58]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023 a . https://arxiv.org/abs/2306.13549 A survey on multimodal large language models . ArXiv preprint, abs/2306.13549

  59. [59]

    Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Xiaoshui Huang, Zhiyong Wang, Lu Sheng, LEI BAI, Jing Shao, and Wanli Ouyang. 2023 b . https://proceedings.neurips.cc/paper_files/paper/2023/file/548a41b9cac6f50dccf7e63e9e1b1b9b-Paper-Datasets_and_Benchmarks.pdf Lamm: Language-assisted multi-modal instruction-tuning dataset, frame...

  60. [60]

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. 2024. https://arxiv.org/abs/2403.04652 Yi: Open foundation models by 01. ai . ArXiv preprint, abs/2403.04652

  61. [61]

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2024. https://proceedings.mlr.press/v235/yu24o.html MM -vet: Evaluating large multimodal models for integrated capabilities . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Rese...

  62. [62]

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. 2024. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556--9567

  63. [63]

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. 2023. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations

  64. [64]

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975--11986

  65. [65]

    Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, Songyang Zhang, Wenwei Zhang, Yining Li, Yang Gao, Peng Sun, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Hang Yan, Conghui He, Xingcheng Zhang, Kai Chen, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. 2024 a . https://arxiv.org/abs/24...

  66. [66]

    Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.1109/CVPR46437.2021.00553 Vinvl: Revisiting visual representations in vision-language models . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 5579--5588. Computer ...

  67. [67]

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. 2023. https://arxiv.org/abs/2303.16199 Llama-adapter: Efficient fine-tuning of language models with zero-init attention . ArXiv preprint, abs/2303.16199

  68. [68]

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, et al. 2024 b . Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? arXiv preprint arXiv:2403.14624

  69. [69]

    Bo Zhao, Boya Wu, and Tiejun Huang. 2023. https://arxiv.org/abs/2307.04087 Svit: Scaling up visual instruction tuning . ArXiv preprint, abs/2307.04087

  70. [70]

    Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. 2024. https://arxiv.org/abs/2309.07915 Mmicl: Empowering vision-language model with multi-modal in-context learning . The Twelfth International Conference on Learning Representations

  71. [71]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. https://openreview.net/forum?id=piecKJ2DlB Gpt-4v(ision) is a generalist web agent, if grounded . In Forty-first International Conference on Machine Learning

  72. [72]

    Corso, and Jianfeng Gao

    Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. https://aaai.org/ojs/index.php/AAAI/article/view/7005 Unified vision-language pre-training for image captioning and VQA . In Proceedings of the AAAI Conference on Artificial Intelligence, 34, pages 13041--13049. AAAI Press

  73. [73]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. https://arxiv.org/abs/2304.10592 Minigpt-4: Enhancing vision-language understanding with advanced large language models . ArXiv preprint, abs/2304.10592