pith. machine review for the scientific record. sign in

arxiv: 2604.20806 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multi-image reasoningOlympiad benchmarklarge vision-language modelsmultimodal reasoningbenchmark evaluationscientific OlympiadsLVLM performance
0
0 comments X

The pith

Current top vision-language models achieve only about 50% accuracy on Olympiad problems that require reasoning across multiple images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OMIBench as a benchmark to test how well large vision-language models handle Olympiad problems where information must be gathered from several images. It covers problems in biology, chemistry, mathematics, and physics, each supplied with human-written rationales for the solutions. Tests reveal that even the best models reach only around 50 percent correct answers under these conditions. The benchmark also includes clear protocols for scoring answers either by exact match or by semantic similarity. This setup helps identify and address weaknesses in combining visual details from multiple sources.

Core claim

The authors create OMIBench, a benchmark of Olympiad-level problems from four scientific fields that require multi-image reasoning, complete with manually annotated rationales and evaluation methods for exact and semantic matching. They report that leading LVLMs, including Gemini-3-Pro, attain only about 50% performance, exposing gaps in current systems' ability to integrate distributed visual evidence.

What carries the argument

OMIBench, a dataset of multi-image Olympiad problems accompanied by annotated rationales and protocols for both exact and semantic answer evaluation.

If this is right

  • Leading LVLMs show clear performance shortfalls on tasks needing evidence from multiple images.
  • Gemini-3-Pro and similar models reach only approximately 50% accuracy on the benchmark.
  • The benchmark provides tools for researchers to measure and improve multi-image reasoning capabilities.
  • Evaluation can use either strict exact matching or more flexible semantic matching of answers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model designs may need explicit mechanisms to link and reason over separate images rather than processing them independently.
  • Existing single-image benchmarks could be overestimating real capabilities for complex, distributed visual tasks.
  • Training data that splits related information across images might help close the observed gaps.
  • Such benchmarks could prove useful in other fields involving multiple visual inputs, like interpreting sets of scientific figures.

Load-bearing premise

The chosen Olympiad problems together with their manually annotated rationales represent the true demands of multi-image reasoning in real Olympiad settings.

What would settle it

Finding that several state-of-the-art LVLMs achieve substantially higher than 50% accuracy on OMIBench, say above 75%, would cast doubt on the extent of the reported reasoning limitations.

Figures

Figures reproduced from arXiv: 2604.20806 by Chengyu Luan, Jiajun Wu, Jingqi Tong, Libo Qin, Qiguang Chen, Qiming Yu, Wanxiang Che, Xiachong Feng, Yi Yang, Yizhuo Li.

Figure 1
Figure 1. Figure 1: Experiment setup [PITH_FULL_IMAGE:figures/full_fig_p029_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: , it hallucinated a non-existent theorem based on visual similarity to [PITH_FULL_IMAGE:figures/full_fig_p031_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fig4 [PITH_FULL_IMAGE:figures/full_fig_p031_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Cofactor Structures Fig2:Reaction Scheme (B1 -> B5/B6) [PITH_FULL_IMAGE:figures/full_fig_p032_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fig2:Reaction Scheme (B1 [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Historical Context (Rabbits at Waterhole) [PITH_FULL_IMAGE:figures/full_fig_p033_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Growth Timeline & Primordia Count [PITH_FULL_IMAGE:figures/full_fig_p044_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Time Data (When) [PITH_FULL_IMAGE:figures/full_fig_p045_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: General Hydrolysis [PITH_FULL_IMAGE:figures/full_fig_p046_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Geometric [PITH_FULL_IMAGE:figures/full_fig_p047_1.png] view at source ↗
read the original abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OMIBench, a benchmark for Olympiad-level multi-image reasoning in LVLMs drawn from biology, chemistry, mathematics, and physics problems. It supplies manually annotated rationales and protocols for exact and semantic answer matching. Experiments across multiple LVLMs report performance gaps, with the strongest model (Gemini-3-Pro) reaching only ~50% accuracy, and position the benchmark as a resource for studying distributed visual evidence in complex reasoning.

Significance. If the problems are shown to require cross-image integration, OMIBench would provide a useful diagnostic resource for an under-tested capability in current LVLMs. The manual rationales could support targeted error analysis, and the multi-domain coverage adds breadth. The work does not include machine-checked elements or parameter-free derivations but offers a concrete empirical testbed.

major comments (3)
  1. [§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.
  2. [§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.
  3. [§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.
minor comments (2)
  1. The abstract and introduction could more precisely state the total number of problems, the distribution across domains, and the exact evaluation metrics used for semantic matching.
  2. Figure captions and table headers would benefit from explicit definitions of the exact vs. semantic matching protocols to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We agree that additional analyses are needed to more rigorously establish that OMIBench isolates multi-image reasoning capabilities. We address each major comment below and will incorporate the suggested revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): No single-image or text-only ablations are reported, nor is there a count of problems solvable from any single image or explicit verification that rationales cite cross-image dependencies. This is load-bearing for interpreting the ~50% Gemini-3-Pro result as evidence of a multi-image reasoning gap rather than general Olympiad hardness.

    Authors: We agree that ablations are essential to substantiate the multi-image focus. In the revised manuscript, we will add single-image and text-only baselines on a representative subset of problems. We will also report the number of problems that, per the annotated rationales, require evidence from multiple images and confirm that the rationales explicitly reference cross-image dependencies. These additions will help differentiate multi-image reasoning challenges from general Olympiad difficulty. revision: yes

  2. Referee: [§3.2] §3.2 (Annotation and Validation): The manuscript provides no inter-annotator agreement statistics, no protocol for confirming that annotations isolate multi-image requirements, and no details on data selection criteria to ensure problems cannot be solved without all images. These omissions weaken the central claim that OMIBench specifically measures multi-image reasoning.

    Authors: We acknowledge the need for greater transparency in the annotation process. Although the rationales were created and cross-checked by domain experts, we will include inter-annotator agreement statistics in the revision. We will also document the protocol used to verify that problems require all provided images and detail the selection criteria that excluded problems solvable from any single image or text alone. revision: yes

  3. Referee: [§4] §4 (Experiments and Results): The reported model accuracies lack per-domain breakdowns, variance estimates, statistical significance tests, or analysis of failure modes tied to image distribution. Without these, the performance gaps cannot be rigorously attributed to the multi-image aspect highlighted in the abstract.

    Authors: We will expand the experimental results to include per-domain accuracy breakdowns for all evaluated models. Where multiple runs are feasible, we will report variance estimates and apply statistical significance tests. We will further add a failure-mode analysis that categorizes errors according to whether they arise from difficulties in cross-image integration versus other reasoning limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction is empirical and self-contained

full rationale

The paper creates OMIBench by curating Olympiad problems from biology, chemistry, mathematics, and physics, supplying manually annotated rationales, and reporting direct empirical accuracy of LVLMs (e.g., Gemini-3-Pro at ~50%). No equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The central claim is an observed performance gap on the new dataset, which is measured externally rather than derived from prior fitted quantities or self-referential definitions. This is the expected non-circular outcome for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the curated problems test genuine multi-image reasoning and that the evaluation protocols are reliable; no free parameters or invented physical entities are involved.

axioms (1)
  • domain assumption Olympiad problems frequently require integrating information distributed across multiple images
    Invoked in the motivation and design of the benchmark as stated in the abstract.
invented entities (1)
  • OMIBench no independent evidence
    purpose: Dataset and evaluation framework for multi-image Olympiad reasoning
    Newly introduced benchmark without external independent validation mentioned in the abstract.

pith-pipeline@v0.9.0 · 5483 in / 1173 out tokens · 32574 ms · 2026-05-10T00:37:27.669492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 37 canonical work pages · 15 internal anchors

  1. [1]

    American invitational mathematics examination (aime) aime 2024-i & ii, 2024

    AIME. American invitational mathematics examination (aime) aime 2024-i & ii, 2024. URLhttps: //huggingface.co/datasets/Maxwell-Jia/AIME_2024

  2. [2]

    American invitational mathematics examination (aime) 2025-i & ii, 2025

    AIME. American invitational mathematics examination (aime) 2025-i & ii, 2025. URL https: //huggingface.co/datasets/opencompass/AIME2025

  3. [3]

    Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025

    Nawaf Alampara, Mara Schilling-Wilhelmi, Martiño Ríos-García, Indrajeet Mandal, Pranav Khetarpal, Hargun Singh Grover, NM Anoop Krishnan, and Kevin Maik Jablonka. Probing the limitations of multimodal language models for chemistry and materials research.Nature computational science, pages 1–10, 2025

  4. [4]

    American mathematics competitions, 2023

    AMC. American mathematics competitions, 2023. URLhttps://artofproblemsolving.com/ wiki/index.php/AMC_Problems_and_Solutions

  5. [5]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 12

  7. [7]

    An augmented benchmark dataset for geometric question answering through dual parallel text encoding

    Jie Cao and Jing Xiao. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong ...

  8. [8]

    Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning

    Jiaqi Chen, Jianheng Tang, Jinghui Qin, Xiaodan Liang, Lingbo Liu, Eric Xing, and Liang Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523, 2021

  9. [9]

    Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression

    JiaqiChen,TongLi,JinghuiQin,PanLu,LiangLin,ChongyuChen,andXiaodanLiang. Unigeo: Unifying geometrylogicalreasoningviareformulatingmathematicalexpression.arXivpreprintarXiv:2212.02746, 2022

  10. [10]

    Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought.Advances in Neural Information Processing Systems, 37:54872–54904, 2024

  11. [11]

    M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought

    Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M 3CoT: A novel benchmark for multi-domain multi-step multi-modal chain-of-thought. In Lun-Wei Ku, Andre Mar- tins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, Bangkok, Th...

  12. [12]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  13. [13]

    Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

    Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research. arXiv preprint arXiv:2507.01903, 2025

  14. [14]

    Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al

    Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, et al. The molecular structure of thought: Mapping the topology of long chain-of-thought reasoning.arXiv preprint arXiv:2601.06002, 2026

  15. [15]

    CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving

    Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, Zeying Huang, Ning Zhang, Yi Sun, Yi Yang, and Hangjie Yuan. CogFlow: Bridging perception and reasoning through knowledge internalization for visual mathematical problem solving. InInternational Conference on Learning Representations (ICLR), 2026

  16. [16]

    Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025

    Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, et al. Visual thoughts: A unified perspective of understanding multimodal chain-of-thought.arXiv preprint arXiv:2505.15510, 2025. 13

  17. [17]

    Comt: A novel benchmark for chain of multi-modal thought on large vision-language models

    Zihui Cheng, Qiguang Chen, Jin Zhang, Hao Fei, Xiaocheng Feng, Wanxiang Che, Min Li, and Libo Qin. Comt: A novel benchmark for chain of multi-modal thought on large vision-language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23678–23686, 2025

  18. [18]

    Evaluating mllms with multimodal multi-image reasoning benchmark

    Ziming Cheng, Binrui Xu, Lisheng Gong, Zuhe Song, Tianshuo Zhou, Shiqi Zhong, Siyu Ren, Mingxiang Chen, Xiangchao Meng, Yuxin Zhang, et al. Evaluating mllms with multimodal multi-image reasoning benchmark.arXiv preprint arXiv:2506.04280, 2025

  19. [19]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  20. [20]

    From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning

    Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, et al. From easy to hard: The mir benchmark for progressive interleaved multi-image reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 859–869, 2025

  21. [21]

    Blink: Multimodal large language models can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

  22. [22]

    Gemini 3: Technical report

    Google DeepMind. Gemini 3: Technical report. Technical report, 2025. https://deepmind. google/

  23. [23]

    Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/ forum?id=v26vwjxOEz

  24. [24]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  25. [25]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  26. [26]

    Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021

    Jan Hrubes, Adam Tywoniak, Martin Balouch, Stanislav Chvíla, and Jan Hrabovsky. Chemistry race/chemiklánı: Team-based competition in chemistry.Journal of Chemical Education, 98(12):3878– 3883, 2021

  27. [27]

    Smith, and Ranjay Krishna

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  28. [28]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 14

  29. [29]

    Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models

    Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, and Wanxiang Che. Mpcc: A novel benchmark for multimodal planning with complex constraints in multimodal large language models. InProceedings of the 33rd ACM International Conference on Multimedia, pages 5188–5197, 2025

  30. [30]

    Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, and Wenhu Chen. Mantis: Interleaved multi-image instruction tuning.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=skLtdUVaJa

  31. [31]

    MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, and Hongsheng Li. MME-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency. In Forty-second International Conference on Machine Learning, 2025. URLh...

  32. [32]

    Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024

    Mehran Kazemi, Nishanth Dikkala, Ankit Anand, Petar Devic, Ishita Dasgupta, Fangyu Liu, Bahare Fatemi, Pranjal Awasthi, Sreenivas Gollapudi, Dee Guo, et al. Remi: A dataset for reasoning with multiple images.Advances in Neural Information Processing Systems, 37:60088–60109, 2024

  33. [33]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

  34. [34]

    Mibench: Evaluating multimodal large language models over multi- ple images.arXiv preprint arXiv:2407.15272, 2024a

    Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, et al. Mibench: Evaluating multimodal large language models over multiple images. arXiv preprint arXiv:2407.15272, 2024

  35. [35]

    Mathematical language models: A survey.ACM Computing Surveys, 2025

    Wentao Liu, Hanglei Hu, Jie Zhou, Yuyang Ding, Junsong Li, Jiayi Zeng, Mengliang He, Qin Chen, Bo Jiang, Aimin Zhou, et al. Mathematical language models: A survey.ACM Computing Surveys, 2025

  36. [36]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  37. [37]

    MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs

    Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, and Jiaqi Wang. MMDU: A multi-turn multi-image dialog understanding bench- mark and instruction-tuning dataset for LVLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  38. [38]

    Inter-gps: Interpretable geometry problem solving with formal language and sym- bolic reasoning.arXiv preprint arXiv:2105.04165, 2021

    Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.arXiv preprint arXiv:2105.04165, 2021

  39. [39]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

  40. [40]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=KUNzEQMWU7. 15

  41. [41]

    Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025

    Mikołaj Małkiński and Jacek Mańdziuk. Deep learning methods for abstract visual reasoning: A survey on raven’s progressive matrices.ACM Computing Surveys, 57(7):1–36, 2025

  42. [42]

    Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

    Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models.arXiv preprint arXiv:2408.02718, 2024

  43. [43]

    Evaluating AI’s ability to perform scientific research tasks

    OpenAI. Evaluating AI’s ability to perform scientific research tasks. OpenAI Blog, 2025.https: //openai.com/index/frontierscience/

  44. [44]

    GPT-5 system card

    OpenAI. GPT-5 system card. Technical report, 2025.https://openai.com/

  45. [45]

    OpenAI o4-mini System Card

    OpenAI. OpenAI o4-mini System Card. Technical report, 2025.https://openai.com/

  46. [46]

    What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024

    Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. What factors affect multi-modal in-context learning? an in-depth exploration.Advances in Neural Information Processing Systems, 37: 123207–123236, 2024

  47. [47]

    Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024

    Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. Scifibench: Benchmarking large multimodal models for scientific figure interpretation.Advances in Neural Information Processing Systems, 37:18695–18728, 2024

  48. [48]

    Semi-off-policy reinforcement learning for vision-language slow- thinking reasoning.arXiv preprint arXiv:2507.16814, 2025

    Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, and Kai Chen. Semi-off-policy reinforcement learning for vision-language slow- thinking reasoning.arXiv preprint arXiv:2507.16814, 2025

  49. [49]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers.arXiv preprint arXiv:2506.23918, 2025

  50. [50]

    Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

    Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.arXiv preprint arXiv:2503.21380, 2025

  51. [51]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  52. [52]

    Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

    Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, et al. Kwai keye-vl technical report.arXiv preprint arXiv:2507.01949, 2025

  53. [53]

    Physics big, 2024

    Zaharov Timur, Konstantin Korolev, and Aleksandr Nikolich. Physics big, 2024. URLhttps:// huggingface.co/datasets/Vikhrmodels/physics_big

  54. [54]

    Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

    Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, et al. Thinking with video: Video generation as a promising multimodal reasoning paradigm.arXiv preprint arXiv:2511.04570, 2025

  55. [55]

    Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025

    Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learning.arXiv preprint arXiv:2506.01713, 2025. 16

  56. [56]

    Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chun- yuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, and Muhao Chen. Muirbench: A comprehensive benchmark for robust multi-image understanding. InThe...

  57. [57]

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan- and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

  58. [58]

    Mv-math: Evaluating multimodal math reasoning in multi-visual contexts

    Peijie Wang, Zhong-Zhi Li, Fei Yin, Dekang Ran, and Cheng-Lin Liu. Mv-math: Evaluating multimodal math reasoning in multi-visual contexts. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19541–19551, 2025

  59. [59]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  60. [60]

    Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences

    Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Fuxiao Liu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, and Furong Huang. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. In Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

  61. [61]

    Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025

  62. [62]

    Slow Perception: Let’s Perceive Geometric Figures Step-by-step,

    Haoran Wei, Youyang Yin, Yumeng Li, Jia Wang, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, and Daxin Jiang. Slow perception: Let’s perceive geometric figures step-by-step.arXiv preprint arXiv:2412.20631, 2024

  63. [63]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  64. [64]

    Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: visualization-of-thought elicits spatial reasoning in large language models.Advances in Neural Information Processing Systems, 37:90277–90317, 2024

  65. [65]

    MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs

    Yunqiu Xu, Linchao Zhu, and Yi Yang. MC-Bench: A benchmark for multi-context visual grounding in the era of MLLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

  66. [66]

    Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms toward agi.arXiv preprint arXiv:2506.23563, 2025

    Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, et al. Mmreason: An open-ended multi-modal multi-step reasoning benchmark for mllms toward agi.arXiv preprint arXiv:2506.23563, 2025. 17

  67. [67]

    Hipho: How far are (m) llms from hu- mans in the latest high school physics olympiad benchmark? arXiv preprint arXiv:2509.07894, 2025

    Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, et al. Hipho: How far are (m) llms from humans in the latest high school physics olympiad benchmark?arXiv preprint arXiv:2509.07894, 2025

  68. [68]

    Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, ZhenzhuYang, YiboLiu, WenhaoHuang, HuanSun, YuSu, andWenhuChen. Mmmu: Amassive multi-discipline multimodal understanding and reasoning benchmark for expert agi....

  69. [69]

    Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

    Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, et al. Mimo-vl technical report.arXiv preprint arXiv:2506.03569, 2025

  70. [70]

    Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

    Yuheng Zha, Kun Zhou, Yujia Wu, Yushu Wang, Jie Feng, Zhi Xu, Shibo Hao, Zhengzhong Liu, Eric P Xing, and Zhiting Hu. Vision-g1: Towards general vision language reasoning with multi-domain data curation.arXiv preprint arXiv:2508.12680, 2025

  71. [71]

    CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation

    Guanghao Zhang, Tao Zhong, Yan Xia, Mushui Liu, Zhelun Yu, Haoyuan Li, Wanggui He, Fangxun Shu, Dong She, Yi Wang, and Hao Jiang. CMMCoT: Enhancing complex multi-image comprehension via multi-modal chain-of-thought and memory augmentation. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  72. [72]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  73. [73]

    Chainv: Atomic visual hints make multimodal reasoning shorter and better.arXiv preprint arXiv:2511.17106, 2025

    Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, and Shanghang Zhang. Chainv: Atomic visual hints make multimodal reasoning shorter and better.arXiv preprint arXiv:2511.17106, 2025

  74. [74]

    Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, hai zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=y1pPWFVfvR

  75. [75]

    Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

    Bingchen Zhao, Yongshuo Zong, Letian Zhang, and Timothy Hospedales. Benchmarking multi-image understanding in vision and language models: Perception, knowledge, reasoning, and multi-hop reasoning.arXiv preprint arXiv:2406.12742, 2024

  76. [76]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314, 2024

  77. [77]

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoninginlargelanguagemodels. InTheEleventhInternationalConferenceonLearningRepresentations,

  78. [78]

    URLhttps://openreview.net/forum?id=WZH7099tgfM

  79. [79]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 18 Appendix A. Data Construction Details OMIBench was constructed via a rigorous multi-stage pipeli...

  80. [80]

    •Slightly unclear transitions between steps

    Acceptwithminoredits.Usethisoptionwhentherationaleisfundamentallycorrectandcomplete, but has small issues such as: •Minor wording problems (e.g., awkward phrasing, ambiguous pronouns). •Slightly unclear transitions between steps. •Cosmetic inconsistencies in notation, symbols, or formatting. In this case, annotators should directly edit the text to correc...

Showing first 80 references.