pith. machine review for the scientific record. sign in

arxiv: 2605.10187 · v2 · submitted 2026-05-11 · 💻 cs.CV

Recognition: unknown

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords SciVQRmultimodal benchmarkscientific reasoningmultimodal large language modelsmulti-step inferencevisual comprehensioninterdisciplinary knowledgeAI evaluation
0
0 comments X

The pith

SciVQR benchmark reveals leading multimodal models fall short on complex scientific reasoning across 54 subfields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciVQR as a multimodal benchmark designed to test how well AI models perform scientific reasoning. It spans 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology, using visuals such as equations, charts, and diagrams. Tasks range from basic factual recall to multi-step inferences, with expert solutions provided for 46 percent of them to allow evaluation of both final answers and the reasoning steps themselves. Testing on current proprietary and open-source models shows clear limitations in handling these demands. This matters because better benchmarks can guide the development of AI systems capable of genuine scientific work.

Core claim

SciVQR is a benchmark that covers 54 subfields across six scientific disciplines and pairs domain-specific visuals with tasks that require visual comprehension plus multi-step inference. It evaluates not only the correctness of answers but also the traceability of the reasoning process, with expert-authored solutions supplied for 46 percent of the items. When applied to leading multimodal large language models, the benchmark exposes significant shortcomings in complex multimodal scientific reasoning.

What carries the argument

The SciVQR benchmark, which supplies domain-specific visuals and tasks that demand both visual understanding and multi-step reasoning across 54 subfields.

If this is right

  • Models will require stronger multi-step reasoning mechanisms to reach high performance on the benchmark.
  • Effective integration of knowledge across different scientific disciplines will become necessary for success.
  • Evaluation methods must track reasoning processes in addition to final answers.
  • Public release of the dataset and code enables direct testing and training improvements.
  • Progress toward scientific intelligence in multimodal models can be measured against this standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that perform well on SciVQR may show improved ability to support real research workflows.
  • The approach could be extended to create comparable benchmarks in applied domains such as engineering.
  • Limitations observed here suggest that simply increasing model size may not resolve the gaps without targeted reasoning enhancements.
  • Baseline human performance data on SciVQR would help quantify how far current models remain from expert level.

Load-bearing premise

The tasks, visuals, and expert solutions chosen for SciVQR accurately reflect the complexity and traceability of real scientific reasoning.

What would settle it

A study in which domain experts solve a sample of SciVQR tasks while documenting their reasoning steps, then compare those steps and success rates against model outputs on the same items.

Figures

Figures reproduced from arXiv: 2605.10187 by 2), 2) ((1) Institute of Automation, (2) School of Artificial Intelligence, (3) OPPO AI Center), Chinese Academy of Sciences (CASIA), Dongze Hao (3), Jiatong Ma (1, Jing Liu (1, Longteng Guo (1, Pengkang Huo (1, Tongtian Yue, Tongtian Yue (1, University of Chinese Academy of Sciences (UCAS), Xuanxu Lin (1, Yuchen Liu, Yuchen Liu (1.

Figure 1
Figure 1. Figure 1: Sampled SciVQR examples from each subject. To solve the questions, the models need strong multi [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Domain diversity in SciVQR. Each color corresponds to one subject: mathematics, physics, chemistry, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning error case of GPT-4o. GPT-4o: To determine the formula of the mineral, we need to count the number of copper (Cu) and oxygen (O) atoms per unit cell. 1. Oxygen Atoms (White): - Corner Atoms: There are 8 corners in the cube, and each corner is shared by 8 adjacent cubes in a 3D lattice. Therefore, each corner atom contributes 1 8 of an atom to the unit cell. - Total contribution from corners: 8 × … view at source ↗
Figure 4
Figure 4. Figure 4: Perception error case of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Perception error case of GPT-4o. Error Reason: The model lacks knowledge of standard reduction potentials and thus failed to perform a weighted average during the calculation. Instead, it simply added the standard electrode potentials of the two reactions directly, leading to an incorrect result. Question: What is the standard reduction potential of hydrogen selenate ion to form elemental selenium under ac… view at source ↗
Figure 6
Figure 6. Figure 6: Knowledge gap error case of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Textual understanding error case of GPT-4o. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example question from geography [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example question from chemistry [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example question from biology [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example question from astronomy [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example question from mathematics [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An example question from physics [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SciVQR, a multimodal benchmark spanning 54 subfields across mathematics, physics, chemistry, geography, astronomy, and biology. It incorporates domain-specific visuals (equations, charts, diagrams) and tasks ranging from factual recall to multi-step inference, with 46% featuring expert-authored solutions. The benchmark evaluates both final answers and reasoning processes in leading proprietary and open-source MLLMs, concluding that current models exhibit significant limitations in complex multimodal scientific reasoning and calling for advances in multi-step reasoning and interdisciplinary knowledge integration. The dataset and evaluation code are released publicly.

Significance. If the benchmark construction and evaluation protocols are shown to be reliable, SciVQR could provide a useful multidisciplinary testbed that extends beyond existing MLLM benchmarks by emphasizing traceable reasoning processes and visual integration across many domains. The public release of data and code supports reproducibility and community follow-up work.

major comments (3)
  1. [Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.
  2. [Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.
  3. [Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.
minor comments (2)
  1. [Abstract] Abstract: Adding the total number of questions or examples would give readers immediate scale context.
  2. [Conclusion] The GitHub link is provided, but the manuscript does not specify the exact license or maintenance plan for the released dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details on metrics, validation, and evaluation protocols.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.

    Authors: We agree the abstract would be strengthened by quantitative support. The full manuscript reports model accuracies, error patterns, and task-type breakdowns in the evaluation section; we will add a concise summary of key metrics (e.g., average accuracy across models and representative error rates) to the abstract. For inter-annotator agreement, we will include a description of the expert review process used for the 46% expert-authored solutions and any consistency checks performed on the remainder. revision: yes

  2. Referee: [Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.

    Authors: We will expand the benchmark construction section with explicit details on expert validation: each subfield task was reviewed by at least one domain specialist for scientific accuracy and reasoning traceability, followed by pilot testing on a small set of models and human subjects to calibrate difficulty. We will also report agreement scores for the expert-authored subset and describe the curation workflow that ensures coverage of real scientific processes. revision: yes

  3. Referee: [Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.

    Authors: We will augment the evaluation section with per-subfield performance tables, explicit baseline comparisons (including random guessing and human expert performance where measured), and a detailed scoring rubric that specifies how partial credit is awarded for intermediate reasoning steps. These additions will make the evidence for limitations in multi-step and interdisciplinary reasoning directly verifiable and comparable to prior benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark paper with no derivations or fitted predictions

full rationale

The paper introduces SciVQR, a new multimodal benchmark covering 54 subfields with domain visuals and expert solutions. It evaluates existing MLLMs on this benchmark and reports limitations in complex reasoning. No equations, parameters, or predictive derivations are present. The central claims rest on the benchmark construction and direct model assessments, which are self-contained contributions without reduction to self-citations, self-definitions, or fitted inputs called predictions. Standard benchmark practices apply with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing benchmarks fail to capture reasoning complexity and that the new tasks do so better; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Scientific reasoning requires integration of multimodal inputs, domain expertise, and multi-step inference across subjects.
    Stated in the first sentence of the abstract as a key aspect of human intelligence that current benchmarks miss.

pith-pipeline@v0.9.0 · 5609 in / 1306 out tokens · 65593 ms · 2026-05-14T21:46:54.331063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 13 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    Wu, Zhiyu and Chen, Xiaokang and Pan, Zizheng and Liu, Xingchao and Liu, Wen and Dai, Damai and Gao, Huazuo and Ma, Yiyang and Wu, Chengyue and Wang, Bingxuan and Xie, Zhenda and Wu, Yu and Hu, Kai and Wang, Jiawei and Sun, Yaofeng and Li, Yukun and Piao, Yishi and Guan, Kang and Liu, Aixin and Xie, Xin and You, Yuxiang and Dong, Kai and Yu, Xingkai and Z...

  9. [9]

    Aaron Hurst and Adam Lerer and Adam P. Goucher and Adam Perelman and Aditya Ramesh and Aidan Clark and AJ Ostrow and Akila Welihinda and Alan Hayes and Alec Radford and Aleksander Madry and Alex Baker-Whitcomb and Alex Beutel and Alex Borzunov and Alex Carney and Alex Chow and Alex Kirillov and Alex Nichol and Alex Paino and Alex Renzin and Alex Tachard P...

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Visual Instruction Tuning , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=

  12. [12]

    Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=. B. 2023 , organization=

  13. [13]

    Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , journal=. Qwen2-

  14. [14]

    Qwen2.5-

    Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Mingkun Yang and Zhaohai Li and Jianqiang Wan and Pengfei Wang and Wei Ding and Zheren Fu and Yiheng Xu and Jiabo Ye and Xi Zhang and Tianbao Xie and Zesen Cheng and Hang Zhang and...

  15. [15]

    Liu, Zikang and Guo, Longteng and Tang, Yepeng and Yue, Tongtian and Cai, Junxian and Ma, Kai and Liu, Qingbin and Chen, Xi and Liu, Jing , booktitle=. V

  16. [16]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Improved Baselines with Visual Instruction Tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  17. [19]

    Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li , journal=

  18. [20]

    Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad , booktitle=. Video-

  19. [21]

    Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , booktitle=. Mini

  20. [22]

    Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J

    Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and...

  21. [23]

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

  22. [24]

    Chain-of-

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle=. Chain-of-

  23. [25]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Towards Reasoning in Large Language Models: A Survey , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  24. [26]

    Qwen2.5-

    Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and...

  25. [27]

    Xin, Huajian and Ren, Z. Z. and Song, Junxiao and Shao, Zhihong and Zhao, Wanjia and Wang, Haocheng and Liu, Bo and Zhang, Liyue and Lu, Xuan and Du, Qiushi and Gao, Wenjun and Zhang, Haowei and Zhu, Qihao and Yang, Dejian and Gou, Zhibin and Wu, Z. F. and Luo, Fuli and Ruan, Chong , booktitle=. Deep

  26. [28]

    , booktitle=

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=. G

  27. [29]

    Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi , booktitle=. Visual-

  28. [30]

    Haozhan Shen and Peng Liu and Jingcheng Li and Chunxin Fang and Yibo Ma and Jiajia Liao and Qiaoli Shen and Zilun Zhang and Kangjia Zhao and Qianqian Zhang and Ruochen Xu and Tiancheng Zhao , journal=. V

  29. [31]

    Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Xu, Zhe and Hu, Yao and Lin, Shaohui , booktitle=. Vision-

  30. [32]

    Lawrence and Parikh, Devi , booktitle=

    Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , booktitle=. V

  31. [33]

    Microsoft

    Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. European Conference on Computer Vision , pages=

  32. [34]

    Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Huang, Xiaoshui and Wang, Zhiyong and Sheng, Lu and Bai, Lei and Shao, Jing and Ouyang Wanli , booktitle=. L

  33. [35]

    Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying , booktitle=. S

  34. [36]

    Conference on Empirical Methods in Natural Language Processing , pages=

    Evaluating Object Hallucination in Large Vision-Language Models , author=. Conference on Empirical Methods in Natural Language Processing , pages=

  35. [37]

    Xu, Cheng and Hou, Xiaofeng and Liu, Jiacheng and Li, Chao and Huang, Tianhao and Zhu, Xiaozhi and Niu, Mo and Sun, Lingyu and Tang, Peng and Xu, Tongqiao and Cheng, Kwang-Ting and Guo, Minyi , booktitle=. M. 2023 , organization=

  36. [38]

    Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=. Math

  37. [39]

    Olympiad

    He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen Leng and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle=. Olympiad

  38. [40]

    Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Qiao, Yu and Gao, Peng and Li, Hongsheng , booktitle=. M

  39. [41]

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham , booktitle=. M

  40. [42]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

    Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , title =...

  41. [43]

    Hao, Yunzhuo and Gu, Jiawei and Wang, Huichen Will and Li, Linjie and Yang, Zhengyuan and Wang, Lijuan and Cheng, Yu , booktitle=. Can

  42. [44]

    Advances in Neural Information Processing Systems , volume=

    Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. Advances in Neural Information Processing Systems , volume=

  43. [45]

    and Wu, Yuhuai and Le, Quoc V

    Trinh, Trieu H. and Wu, Yuhuai and Le, Quoc V. and He, He and Luong, Thang , journal=. Solving. 2024 , publisher=

  44. [46]

    Gao, Bofei and Song, Feifan and Yang, Zhe and Cai, Zefan and Miao, Yibo and Dong, Qingxiu and Li, Lei and Ma, Chenghao and Chen, Liang and Xu, Runxin and Tang, Zhengyang and Benyou, Wang and Zan, Daoguang and Quan, Shanghaoran and Zhang, Ge and Sha, Lei and Zhang, Yichang and Ren, Xuancheng and Liu, Tianyu and Chang, Baobao , booktitle=. Omni-

  45. [47]

    Aaron Jaech and Adam Kalai and Adam Lerer and Adam Richardson and Ahmed El-Kishky and Aiden Low and Alec Helyar and Aleksander Madry and Alex Beutel and Alex Carney and Alex Iftimie and Alex Karpenko and Alex Tachard Passos and Alexander Neitz and Alexander Prokofiev and Alexander Wei and Allison Tam and Ally Bennett and Ananya Kumar and Andre Saraiva and...

  46. [48]

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. L

  47. [49]

    Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen Deng and Songze Li and Yinan He a...

  48. [50]

    Yuan Yao and Tianyu Yu and Ao Zhang and Chongyi Wang and Junbo Cui and Hongji Zhu and Tianchi Cai and Haoyu Li and Weilin Zhao and Zhihui He and Qianyu Chen and Huarong Zhou and Zhensheng Zou and Haoye Zhang and Shengding Hu and Zhi Zheng and Jie Zhou and Jie Cai and Xu Han and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun , journal=. Mini

  49. [51]

    Measuring Multimodal Mathematical Reasoning with the

    Wang, Ke and Pan, Junting and Shi, Weikang and Lu, Zimu and Ren, Houxing and Zhou, Aojun and Zhan, Mingjie and Li, Hongsheng , booktitle=. Measuring Multimodal Mathematical Reasoning with the

  50. [53]

    Smith and Hannaneh Hajishirzi and Ross Girshick and Ali Farhadi and Aniruddha Kembhavi , booktitle=

    Matt Deitke and Christopher Clark and Sangho Lee and Rohun Tripathi and Yue Yang and Jae Sung Park and Mohammadreza Salehi and Niklas Muennighoff and Kyle Lo and Luca Soldaini and Jiasen Lu and Taira Anderson and Erin Bransom and Kiana Ehsani and Huong Ngo and YenSung Chen and Ajay Patel and Mark Yatskar and Chris Callison-Burch and Andrew Head and Rose H...

  51. [54]

    Bingquan Xia and Bowen Shen and Cici and Dawei Zhu and Di Zhang and Gang Wang and Hailin Zhang and Huaqiu Liu and Jiebao Xiao and Jinhao Dong and Liang Zhao and Peidian Li and Peng Wang and Shihua Yu and Shimao Chen and Weikun Wang and Wenhan Ma and Xiangwei Deng and Yi Huang and Yifan Song and Zihan Jiang and Bowen Ye and Can Cai and Chenhong He and Dong...

  52. [56]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Moitinho de Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, and 260 others. 2023. G PT -4 technical report. arXiv preprin...

  53. [57]

    Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, and 1331 others. 2023. Gemini: A family of highly ca...

  54. [58]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. V QA : Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425--2433

  55. [59]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. Qwen2.5- VL technical report. arXiv preprint arXiv:2502.13923

  56. [60]

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, and 31 others. 2025. Molmo and PixM o: Open weights and open data for state-of-...

  57. [61]

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Wang Benyou, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2025. Omni- MATH : A universal O lympiad level mathematic benchmark for large language models. In Inter...

  58. [62]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deep S eek- R 1 incentivizes reasoning in LLM s through reinforcement learning. Nature 645, pages 633--638

  59. [63]

    Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. 2025. Can MLLM s reason in multimodality? EMMA : An enhanced multimodal reasoning benchmark. In International Conference on Machine Learning

  60. [64]

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. Olympiad B ench: A challenging benchmark for promoting AGI with O lympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Associat...

  61. [65]

    Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065

  62. [66]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2026. Vision- R 1: Incentivizing reasoning capability in multimodal large language models. In International Conference on Learning Representations

  63. [67]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, and 5 others. 2024. Qwen2.5- C oder technical report. arXiv preprint arXiv:2409.12186

  64. [68]

    Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, and 399 others. 2024. G PT -4o system card. arXiv preprint arXiv:2410.21276

  65. [69]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and 242 others. 2024. Open AI o1 system card. arXiv preprint arXiv:2412.16720

  66. [70]

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. LL a VA - O ne V ision: Easy visual task transfer. Transactions on Machine Learning Research

  67. [71]

    Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. S EED - B ench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299--13308

  68. [72]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023 a . B LIP -2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, pages 19730--19742. PMLR

  69. [73]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing, pages 292--305

  70. [74]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft COCO : Common objects in context. In European Conference on Computer Vision, pages 740--755

  71. [75]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, and 180 others. 2024 a . Deep S eek- V 3 technical report. arXiv preprint arXiv:2412.19437

  72. [76]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 b . Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26286--26296

  73. [77]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 c . https://llava-vl.github.io/blog/2024-01-30-llava-next/ L L a VA - N e XT : Improved reasoning, OCR , and world knowledge

  74. [78]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916

  75. [79]

    Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. 2025 a . V R o PE : Rotary position embedding for video large language models. In Conference on Empirical Methods in Natural Language Processing

  76. [80]

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025 b . Visual- RFT : Visual reinforcement fine-tuning. In International Conference on Computer Vision, pages 2034--2044

  77. [81]

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. Math V ista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations

  78. [82]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, volume 35, pages 2507--2521

  79. [83]

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video- C hat GPT : Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585--12602

  80. [84]

    OpenAI . 2023. https://openai.com/index/gpt-4v-system-card/ G PT -4 V (ision) System Card . Accessed: 2025-04-18

Showing first 80 references.