arxiv: 2605.10187 · v2 · submitted 2026-05-11 · 💻 cs.CV

Recognition: unknown

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Longteng Guo (1 , 2) , Xuanxu Lin (1 , Tongtian Yue , Dongze Hao (3) , Tongtian Yue (1 , Yuchen Liu , Pengkang Huo (1

show 8 more authors

Jiatong Ma (1 Yuchen Liu (1 Jing Liu (1 2) ((1) Institute of Automation Chinese Academy of Sciences (CASIA) (2) School of Artificial Intelligence University of Chinese Academy of Sciences (UCAS) (3) OPPO AI Center)

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords SciVQRmultimodal benchmarkscientific reasoningmultimodal large language modelsmulti-step inferencevisual comprehensioninterdisciplinary knowledgeAI evaluation

0 comments

The pith

SciVQR benchmark reveals leading multimodal models fall short on complex scientific reasoning across 54 subfields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SciVQR as a multimodal benchmark designed to test how well AI models perform scientific reasoning. It spans 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology, using visuals such as equations, charts, and diagrams. Tasks range from basic factual recall to multi-step inferences, with expert solutions provided for 46 percent of them to allow evaluation of both final answers and the reasoning steps themselves. Testing on current proprietary and open-source models shows clear limitations in handling these demands. This matters because better benchmarks can guide the development of AI systems capable of genuine scientific work.

Core claim

SciVQR is a benchmark that covers 54 subfields across six scientific disciplines and pairs domain-specific visuals with tasks that require visual comprehension plus multi-step inference. It evaluates not only the correctness of answers but also the traceability of the reasoning process, with expert-authored solutions supplied for 46 percent of the items. When applied to leading multimodal large language models, the benchmark exposes significant shortcomings in complex multimodal scientific reasoning.

What carries the argument

The SciVQR benchmark, which supplies domain-specific visuals and tasks that demand both visual understanding and multi-step reasoning across 54 subfields.

If this is right

Models will require stronger multi-step reasoning mechanisms to reach high performance on the benchmark.
Effective integration of knowledge across different scientific disciplines will become necessary for success.
Evaluation methods must track reasoning processes in addition to final answers.
Public release of the dataset and code enables direct testing and training improvements.
Progress toward scientific intelligence in multimodal models can be measured against this standard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that perform well on SciVQR may show improved ability to support real research workflows.
The approach could be extended to create comparable benchmarks in applied domains such as engineering.
Limitations observed here suggest that simply increasing model size may not resolve the gaps without targeted reasoning enhancements.
Baseline human performance data on SciVQR would help quantify how far current models remain from expert level.

Load-bearing premise

The tasks, visuals, and expert solutions chosen for SciVQR accurately reflect the complexity and traceability of real scientific reasoning.

What would settle it

A study in which domain experts solve a sample of SciVQR tasks while documenting their reasoning steps, then compare those steps and success rates against model outputs on the same items.

Figures

Figures reproduced from arXiv: 2605.10187 by 2), 2) ((1) Institute of Automation, (2) School of Artificial Intelligence, (3) OPPO AI Center), Chinese Academy of Sciences (CASIA), Dongze Hao (3), Jiatong Ma (1, Jing Liu (1, Longteng Guo (1, Pengkang Huo (1, Tongtian Yue, Tongtian Yue (1, University of Chinese Academy of Sciences (UCAS), Xuanxu Lin (1, Yuchen Liu, Yuchen Liu (1.

**Figure 2.** Figure 2: Domain diversity in SciVQR. Each color corresponds to one subject: mathematics, physics, chemistry, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reasoning error case of GPT-4o. GPT-4o: To determine the formula of the mineral, we need to count the number of copper (Cu) and oxygen (O) atoms per unit cell. 1. Oxygen Atoms (White): - Corner Atoms: There are 8 corners in the cube, and each corner is shared by 8 adjacent cubes in a 3D lattice. Therefore, each corner atom contributes 1 8 of an atom to the unit cell. - Total contribution from corners: 8 × … view at source ↗

**Figure 4.** Figure 4: Perception error case of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Perception error case of GPT-4o. Error Reason: The model lacks knowledge of standard reduction potentials and thus failed to perform a weighted average during the calculation. Instead, it simply added the standard electrode potentials of the two reactions directly, leading to an incorrect result. Question: What is the standard reduction potential of hydrogen selenate ion to form elemental selenium under ac… view at source ↗

**Figure 6.** Figure 6: Knowledge gap error case of GPT-4o [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Textual understanding error case of GPT-4o. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: An example question from geography [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: An example question from chemistry [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: An example question from biology [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: An example question from astronomy [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: An example question from mathematics [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: An example question from physics [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciVQR adds a broad multidisciplinary benchmark for MLLM scientific reasoning with public data but stays high-level on validation.

read the letter

Hi colleague, SciVQR is a new benchmark for multimodal scientific reasoning that spans 54 subfields and evaluates both answers and reasoning processes. It shows current MLLMs struggle with complex tasks, but the paper's high-level description leaves some questions about how solid the benchmark really is. What the work does well is expand coverage to multiple disciplines including math, physics, chemistry, geography, astronomy, and biology. The tasks mix factual recall with multi-step inference and incorporate domain visuals such as charts and diagrams. Having expert-authored solutions for 46% of the items and releasing the dataset plus evaluation code publicly makes it easier for others to build on. This setup can help push models toward better integration of visual and reasoning skills in science contexts. The soft spots are in the lack of reported details on construction. We don't see numbers on inter-annotator agreement or how the tasks were validated to ensure they reflect genuine scientific reasoning across those fields. That makes the claim of revealed limitations harder to assess fully. A deeper comparison to earlier benchmarks would also help clarify what is gained here. This paper is for groups working on multimodal models for scientific assistance or for benchmark creators in AI. Readers interested in practical evaluation of MLLMs would find the scope useful if the methods check out. It deserves peer review. The idea is timely and the public release is a strength, so referees can help refine the validation aspects.

Referee Report

3 major / 2 minor

Summary. The paper introduces SciVQR, a multimodal benchmark spanning 54 subfields across mathematics, physics, chemistry, geography, astronomy, and biology. It incorporates domain-specific visuals (equations, charts, diagrams) and tasks ranging from factual recall to multi-step inference, with 46% featuring expert-authored solutions. The benchmark evaluates both final answers and reasoning processes in leading proprietary and open-source MLLMs, concluding that current models exhibit significant limitations in complex multimodal scientific reasoning and calling for advances in multi-step reasoning and interdisciplinary knowledge integration. The dataset and evaluation code are released publicly.

Significance. If the benchmark construction and evaluation protocols are shown to be reliable, SciVQR could provide a useful multidisciplinary testbed that extends beyond existing MLLM benchmarks by emphasizing traceable reasoning processes and visual integration across many domains. The public release of data and code supports reproducibility and community follow-up work.

major comments (3)

[Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.
[Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.
[Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.

minor comments (2)

[Abstract] Abstract: Adding the total number of questions or examples would give readers immediate scale context.
[Conclusion] The GitHub link is provided, but the manuscript does not specify the exact license or maintenance plan for the released dataset.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details on metrics, validation, and evaluation protocols.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the evaluation 'reveals significant limitations' in MLLMs is presented without any quantitative metrics, error breakdowns, or inter-annotator agreement statistics, making it impossible to assess whether the observed shortcomings are robust or merely artifacts of task selection.

Authors: We agree the abstract would be strengthened by quantitative support. The full manuscript reports model accuracies, error patterns, and task-type breakdowns in the evaluation section; we will add a concise summary of key metrics (e.g., average accuracy across models and representative error rates) to the abstract. For inter-annotator agreement, we will include a description of the expert review process used for the 46% expert-authored solutions and any consistency checks performed on the remainder. revision: yes
Referee: [Benchmark construction] Benchmark construction section: The paper states that tasks 'accurately capture the complexity and traceability of real scientific reasoning processes' across 54 subfields, yet provides no details on expert validation procedures, pilot testing, or agreement scores; this directly affects the load-bearing assumption that the benchmark is a faithful proxy for scientific intelligence.

Authors: We will expand the benchmark construction section with explicit details on expert validation: each subfield task was reviewed by at least one domain specialist for scientific accuracy and reasoning traceability, followed by pilot testing on a small set of models and human subjects to calibrate difficulty. We will also report agreement scores for the expert-authored subset and describe the curation workflow that ensures coverage of real scientific processes. revision: yes
Referee: [Evaluation] Evaluation section: Without reported baselines, per-subfield performance tables, or process-tracing rubrics (e.g., how partial credit is assigned to reasoning steps), the assertion that models lack 'multi-step reasoning and better integration of interdisciplinary knowledge' cannot be verified or compared to prior work.

Authors: We will augment the evaluation section with per-subfield performance tables, explicit baseline comparisons (including random guessing and human expert performance where measured), and a detailed scoring rubric that specifies how partial credit is awarded for intermediate reasoning steps. These additions will make the evidence for limitations in multi-step and interdisciplinary reasoning directly verifiable and comparable to prior benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark paper with no derivations or fitted predictions

full rationale

The paper introduces SciVQR, a new multimodal benchmark covering 54 subfields with domain visuals and expert solutions. It evaluates existing MLLMs on this benchmark and reports limitations in complex reasoning. No equations, parameters, or predictive derivations are present. The central claims rest on the benchmark construction and direct model assessments, which are self-contained contributions without reduction to self-citations, self-definitions, or fitted inputs called predictions. Standard benchmark practices apply with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that existing benchmarks fail to capture reasoning complexity and that the new tasks do so better; no free parameters or invented entities are described.

axioms (1)

domain assumption Scientific reasoning requires integration of multimodal inputs, domain expertise, and multi-step inference across subjects.
Stated in the first sentence of the abstract as a key aspect of human intelligence that current benchmarks miss.

pith-pipeline@v0.9.0 · 5609 in / 1306 out tokens · 65593 ms · 2026-05-14T21:46:54.331063+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 101 canonical work pages · 13 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

Wu, Zhiyu and Chen, Xiaokang and Pan, Zizheng and Liu, Xingchao and Liu, Wen and Dai, Damai and Gao, Huazuo and Ma, Yiyang and Wu, Chengyue and Wang, Bingxuan and Xie, Zhenda and Wu, Yu and Hu, Kai and Wang, Jiawei and Sun, Yaofeng and Li, Yukun and Piao, Yishi and Guan, Kang and Liu, Aixin and Xie, Xin and You, Yuxiang and Dong, Kai and Yu, Xingkai and Z...

work page
[9]

Aaron Hurst and Adam Lerer and Adam P. Goucher and Adam Perelman and Aditya Ramesh and Aidan Clark and AJ Ostrow and Akila Welihinda and Alan Hayes and Alec Radford and Aleksander Madry and Alex Baker-Whitcomb and Alex Beutel and Alex Borzunov and Alex Carney and Alex Chow and Alex Kirillov and Alex Nichol and Alex Paino and Alex Renzin and Alex Tachard P...

work page
[10]

Advances in Neural Information Processing Systems , volume=

Visual Instruction Tuning , author=. Advances in Neural Information Processing Systems , volume=

work page
[11]

Bo Li and Yuanhan Zhang and Dong Guo and Renrui Zhang and Feng Li and Hao Zhang and Kaichen Zhang and Peiyuan Zhang and Yanwei Li and Ziwei Liu and Chunyuan Li , journal=

work page
[12]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=. B. 2023 , organization=

work page 2023
[13]

Peng Wang and Shuai Bai and Sinan Tan and Shijie Wang and Zhihao Fan and Jinze Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Yang Fan and Kai Dang and Mengfei Du and Xuancheng Ren and Rui Men and Dayiheng Liu and Chang Zhou and Jingren Zhou and Junyang Lin , journal=. Qwen2-

work page
[14]

Qwen2.5-

Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Mingkun Yang and Zhaohai Li and Jianqiang Wan and Pengfei Wang and Wei Ding and Zheren Fu and Yiheng Xu and Jiabo Ye and Xi Zhang and Tianbao Xie and Zesen Cheng and Hang Zhang and...

work page
[15]

Liu, Zikang and Guo, Longteng and Tang, Yepeng and Yue, Tongtian and Cai, Junxian and Ma, Kai and Liu, Qingbin and Chen, Xi and Liu, Jing , booktitle=. V

work page
[16]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Improved Baselines with Visual Instruction Tuning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[19]

Yuanhan Zhang and Jinming Wu and Wei Li and Bo Li and Zejun Ma and Ziwei Liu and Chunyuan Li , journal=

work page
[20]

Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad , booktitle=. Video-

work page
[21]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , booktitle=. Mini

work page
[22]

Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J

Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and...

work page
[23]

Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bochao and Feng, Bei ...

work page
[24]

Chain-of-

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny , booktitle=. Chain-of-

work page
[25]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Towards Reasoning in Large Language Models: A Survey , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[26]

Qwen2.5-

Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Lu, Keming and Kai Dang and Yang Fan and Yichang Zhang and An Yang and Rui Men and Fei Huang and Bo Zheng and Yibo Miao and Shanghaoran Quan and Yunlong Feng and Xingzhang Ren and Xuancheng Ren and Jingren Zhou and...

work page
[27]

Xin, Huajian and Ren, Z. Z. and Song, Junxiao and Shao, Zhihong and Zhao, Wanjia and Wang, Haocheng and Liu, Bo and Zhang, Liyue and Lu, Xuan and Du, Qiushi and Gao, Wenjun and Zhang, Haowei and Zhu, Qihao and Yang, Dejian and Gou, Zhibin and Wu, Z. F. and Luo, Fuli and Ruan, Chong , booktitle=. Deep

work page
[28]

, booktitle=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , booktitle=. G

work page
[29]

Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Dong, Xiaoyi and Cao, Yuhang and Duan, Haodong and Lin, Dahua and Wang, Jiaqi , booktitle=. Visual-

work page
[30]

Haozhan Shen and Peng Liu and Jingcheng Li and Chunxin Fang and Yibo Ma and Jiajia Liao and Qiaoli Shen and Zilun Zhang and Kangjia Zhao and Qianqian Zhang and Ruochen Xu and Tiancheng Zhao , journal=. V

work page
[31]

Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Xu, Zhe and Hu, Yao and Lin, Shaohui , booktitle=. Vision-

work page
[32]

Lawrence and Parikh, Devi , booktitle=

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , booktitle=. V

work page
[33]

Microsoft

Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Microsoft. European Conference on Computer Vision , pages=

work page
[34]

Yin, Zhenfei and Wang, Jiong and Cao, Jianjian and Shi, Zhelun and Liu, Dingning and Li, Mukai and Huang, Xiaoshui and Wang, Zhiyong and Sheng, Lu and Bai, Lei and Shao, Jing and Ouyang Wanli , booktitle=. L

work page
[35]

Li, Bohao and Ge, Yuying and Ge, Yixiao and Wang, Guangzhi and Wang, Rui and Zhang, Ruimao and Shan, Ying , booktitle=. S

work page
[36]

Conference on Empirical Methods in Natural Language Processing , pages=

Evaluating Object Hallucination in Large Vision-Language Models , author=. Conference on Empirical Methods in Natural Language Processing , pages=

work page
[37]

Xu, Cheng and Hou, Xiaofeng and Liu, Jiacheng and Li, Chao and Huang, Tianhao and Zhu, Xiaozhi and Niu, Mo and Sun, Lingyu and Tang, Peng and Xu, Tongqiao and Cheng, Kwang-Ting and Guo, Minyi , booktitle=. M. 2023 , organization=

work page 2023
[38]

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=. Math

work page
[39]

Olympiad

He, Chaoqun and Luo, Renjie and Bai, Yuzhuo and Hu, Shengding and Thai, Zhen Leng and Shen, Junhao and Hu, Jinyi and Han, Xu and Huang, Yujie and Zhang, Yuxiang and Liu, Jie and Qi, Lei and Liu, Zhiyuan and Sun, Maosong , booktitle=. Olympiad

work page
[40]

Zhang, Renrui and Jiang, Dongzhi and Zhang, Yichi and Lin, Haokun and Guo, Ziyu and Qiu, Pengshuo and Zhou, Aojun and Lu, Pan and Chang, Kai-Wei and Qiao, Yu and Gao, Peng and Li, Hongsheng , booktitle=. M

work page
[41]

Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham , booktitle=. M

work page
[42]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and Wei, Cong and Yu, Botao and Yuan, Ruibin and Sun, Renliang and Yin, Ming and Zheng, Boyuan and Yang, Zhenzhu and Liu, Yibo and Huang, Wenhao and Sun, Huan and Su, Yu and Chen, Wenhu , title =...

work page
[43]

Hao, Yunzhuo and Gu, Jiawei and Wang, Huichen Will and Li, Linjie and Yang, Zhengyuan and Wang, Lijuan and Cheng, Yu , booktitle=. Can

work page
[44]

Advances in Neural Information Processing Systems , volume=

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. Advances in Neural Information Processing Systems , volume=

work page
[45]

and Wu, Yuhuai and Le, Quoc V

Trinh, Trieu H. and Wu, Yuhuai and Le, Quoc V. and He, He and Luong, Thang , journal=. Solving. 2024 , publisher=

work page 2024
[46]

Gao, Bofei and Song, Feifan and Yang, Zhe and Cai, Zefan and Miao, Yibo and Dong, Qingxiu and Li, Lei and Ma, Chenghao and Chen, Liang and Xu, Runxin and Tang, Zhengyang and Benyou, Wang and Zan, Daoguang and Quan, Shanghaoran and Zhang, Ge and Sha, Lei and Zhang, Yichang and Ren, Xuancheng and Liu, Tianyu and Chang, Baobao , booktitle=. Omni-

work page
[47]

Aaron Jaech and Adam Kalai and Adam Lerer and Adam Richardson and Ahmed El-Kishky and Aiden Low and Alec Helyar and Aleksander Madry and Alex Beutel and Alex Carney and Alex Iftimie and Alex Karpenko and Alex Tachard Passos and Alexander Neitz and Alexander Prokofiev and Alexander Wei and Allison Tam and Ally Bennett and Ananya Kumar and Andre Saraiva and...

work page
[48]

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae , month=. L

work page
[49]

Jinguo Zhu and Weiyun Wang and Zhe Chen and Zhaoyang Liu and Shenglong Ye and Lixin Gu and Hao Tian and Yuchen Duan and Weijie Su and Jie Shao and Zhangwei Gao and Erfei Cui and Xuehui Wang and Yue Cao and Yangzhou Liu and Xingguang Wei and Hongjie Zhang and Haomin Wang and Weiye Xu and Hao Li and Jiahao Wang and Nianchen Deng and Songze Li and Yinan He a...

work page
[50]

Yuan Yao and Tianyu Yu and Ao Zhang and Chongyi Wang and Junbo Cui and Hongji Zhu and Tianchi Cai and Haoyu Li and Weilin Zhao and Zhihui He and Qianyu Chen and Huarong Zhou and Zhensheng Zou and Haoye Zhang and Shengding Hu and Zhi Zheng and Jie Zhou and Jie Cai and Xu Han and Guoyang Zeng and Dahai Li and Zhiyuan Liu and Maosong Sun , journal=. Mini

work page
[51]

Measuring Multimodal Mathematical Reasoning with the

Wang, Ke and Pan, Junting and Shi, Weikang and Lu, Zimu and Ren, Houxing and Zhou, Aojun and Zhan, Mingjie and Li, Hongsheng , booktitle=. Measuring Multimodal Mathematical Reasoning with the

work page
[53]

Smith and Hannaneh Hajishirzi and Ross Girshick and Ali Farhadi and Aniruddha Kembhavi , booktitle=

Matt Deitke and Christopher Clark and Sangho Lee and Rohun Tripathi and Yue Yang and Jae Sung Park and Mohammadreza Salehi and Niklas Muennighoff and Kyle Lo and Luca Soldaini and Jiasen Lu and Taira Anderson and Erin Bransom and Kiana Ehsani and Huong Ngo and YenSung Chen and Ajay Patel and Mark Yatskar and Chris Callison-Burch and Andrew Head and Rose H...

work page
[54]

Bingquan Xia and Bowen Shen and Cici and Dawei Zhu and Di Zhang and Gang Wang and Hailin Zhang and Huaqiu Liu and Jiebao Xiao and Jinhao Dong and Liang Zhao and Peidian Li and Peng Wang and Shihua Yu and Shimao Chen and Weikun Wang and Wenhan Ma and Xiangwei Deng and Yi Huang and Yifan Song and Zihan Jiang and Bowen Ye and Can Cai and Chenhong He and Dong...

work page
[56]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Moitinho de Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, and 260 others. 2023. G PT -4 technical report. arXiv preprin...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, and 1331 others. 2023. Gemini: A family of highly ca...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. V QA : Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425--2433

work page 2015
[59]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. Qwen2.5- VL technical report. arXiv preprint arXiv:2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, and 31 others. 2025. Molmo and PixM o: Open weights and open data for state-of-...

work page 2025
[61]

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Wang Benyou, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2025. Omni- MATH : A universal O lympiad level mathematic benchmark for large language models. In Inter...

work page 2025
[62]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, and 175 others. 2025. Deep S eek- R 1 incentivizes reasoning in LLM s through reinforcement learning. Nature 645, pages 633--638

work page 2025
[63]

Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. 2025. Can MLLM s reason in multimodality? EMMA : An enhanced multimodal reasoning benchmark. In International Conference on Machine Learning

work page 2025
[64]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. Olympiad B ench: A challenging benchmark for promoting AGI with O lympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Associat...

work page 2024
[65]

Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049--1065

work page 2023
[66]

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2026. Vision- R 1: Incentivizing reasoning capability in multimodal large language models. In International Conference on Learning Representations

work page 2026
[67]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, and 5 others. 2024. Qwen2.5- C oder technical report. arXiv preprint arXiv:2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Madry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, and 399 others. 2024. G PT -4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, and 242 others. 2024. Open AI o1 system card. arXiv preprint arXiv:2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2025. LL a VA - O ne V ision: Easy visual task transfer. Transactions on Machine Learning Research

work page 2025
[71]

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, and Ying Shan. 2024. S EED - B ench: Benchmarking multimodal large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13299--13308

work page 2024
[72]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023 a . B LIP -2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, pages 19730--19742. PMLR

work page 2023
[73]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing, pages 292--305

work page 2023
[74]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C. Lawrence Zitnick. 2014. Microsoft COCO : Common objects in context. In European Conference on Computer Vision, pages 740--755

work page 2014
[75]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, and 180 others. 2024 a . Deep S eek- V 3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024 b . Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26286--26296

work page 2024
[77]

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024 c . https://llava-vl.github.io/blog/2024-01-30-llava-next/ L L a VA - N e XT : Improved reasoning, OCR , and world knowledge

work page 2024
[78]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, pages 34892--34916

work page 2023
[79]

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. 2025 a . V R o PE : Rotary position embedding for video large language models. In Conference on Empirical Methods in Natural Language Processing

work page 2025
[80]

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. 2025 b . Visual- RFT : Visual reinforcement fine-tuning. In International Conference on Computer Vision, pages 2034--2044

work page 2025
[81]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2024. Math V ista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations

work page 2024
[82]

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, volume 35, pages 2507--2521

work page 2022
[83]

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. 2024. Video- C hat GPT : Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585--12602

work page 2024
[84]

OpenAI . 2023. https://openai.com/index/gpt-4v-system-card/ G PT -4 V (ision) System Card . Accessed: 2025-04-18

work page 2023

Showing first 80 references.