H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning

Basura Fernando; Debaditya Roy; Eric Peh

arxiv: 2606.29915 · v1 · pith:CJHWHTBLnew · submitted 2026-06-29 · 💻 cs.CV

H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning

Eric Peh , Debaditya Roy , Basura Fernando This is my paper

Pith reviewed 2026-06-30 06:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords Vision-Language ModelsGrounded Visual ReasoningDe-compositional Evidence GroundingInterpretabilityHallucination ReductionAtomic Sub-questionsBounding Box Grounding

0 comments

The pith

Vision-language models produce final answers as logical consequences of verified visual facts by decomposing queries into atomic sub-questions each paired with a localized bounding box.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces De-compositional Evidence Grounding to address hallucination and shortcut reliance in VLMs. The approach requires the model to break a global query into a sequence of atomic sub-questions, supplying both a sub-answer and an explicit bounding box for each. Grounding each intermediate step in a specific image region creates a chain of verified facts. The final answer then follows directly from those facts rather than from statistical patterns alone. This structure is intended to improve both accuracy and the ability to inspect the reasoning process.

Core claim

By forcing decomposition of a query into atomic sub-questions that each demand an explicit sub-answer and a localized evidence bounding box, the model constructs a structured reasoning path in which the final answer emerges as a logical consequence of visually grounded facts.

What carries the argument

De-compositional Evidence Grounding, a process that decomposes global queries into atomic sub-questions each requiring a sub-answer and a localized bounding box to ground the step in the image.

If this is right

The final answer is derived directly from verified visual facts instead of statistical guesses.
Each intermediate reasoning step can be inspected by checking the supplied bounding box and sub-answer.
Hallucinations are reduced because every claim in the chain must be tied to an explicit image region.
The method applies to tasks requiring multi-step visual deduction such as container-liquid or environmental-context problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The added requirement for bounding-box output may increase the cost of data annotation or model supervision beyond standard VLM training.
The same decomposition strategy could be tested on non-visual reasoning tasks if suitable localization signals can be defined.
If the sub-question sequence is fixed in advance rather than learned, the approach may limit the model's ability to discover novel reasoning orders.

Load-bearing premise

Training or prompting the model to generate accurate atomic sub-answers and correct bounding boxes for every sub-question will not introduce new errors or demand prohibitive extra supervision, and the decomposition will improve rather than hurt final accuracy.

What would settle it

An experiment showing that models trained to output the required sub-answers and bounding boxes achieve equal or lower accuracy on visual reasoning benchmarks than standard VLMs without the decomposition step.

Figures

Figures reproduced from arXiv: 2606.29915 by Basura Fernando, Debaditya Roy, Eric Peh.

**Figure 2.** Figure 2: Qualitative examples of grounded reasoning decomposition. For each image-question pair, [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) often achieve high performance on benchmarks while remaining "black boxes", yet they remain prone to hallucination or rely on superficial shortcuts. In this work, we propose a framework designed to enhance both performance and interpretability through De-compositional Evidence Grounding. Unlike monolithic inference approaches, our approach forces the model to decompose a global query into a sequence of atomic sub-questions, each requiring an explicit sub-answer and critically a localized evidence bounding box. By grounding intermediate logical steps (e.g. identifying a container, analyzing liquid properties, and assessing environmental context) in specific visual regions, we construct a structured reasoning path that mirrors human-like deduction. This allows the final answer to emerge as a logical consequence of verified visual facts rather than a statistical guess.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract omits the H-GRPO and permutation-invariant RL from the title and supplies zero methods, data, or results to back the grounding claims.

read the letter

The first thing to know is that the abstract never mentions H-GRPO or permutation-invariant RL even though those terms are in the title. The text instead describes a decomposition approach where a query is broken into atomic sub-questions, each paired with a sub-answer and a bounding box. The goal is to make VLM reasoning traceable so the final output follows from verified visual facts rather than guesses.

That decomposition idea is straightforward and addresses a real issue with hallucinations and shortcut learning. Requiring explicit localized evidence for each step could in principle let a reader check the chain. The examples given (container, liquid properties, context) show the intended structure.

The problems start with the complete absence of any supporting material. No equations, no RL objective, no training procedure, no datasets, and no numbers appear. The title promises a specific RL method, yet nothing explains how permutation invariance is enforced or how the model is rewarded for correct boxes and sub-answers. Without that, the central claim that the final answer emerges as a logical consequence cannot be evaluated. Faulty intermediates would simply propagate, and there is no evidence the extra supervision improves rather than hurts end accuracy.

This is an early proposal aimed at people already working on grounded VLM reasoning. It does not yet contain the formal or empirical content needed for a serious referee. The mismatch between title and abstract alone would require a revision before review.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes H-GRPO, described as a permutation-invariant reinforcement learning method for grounded visual reasoning in VLMs. It introduces De-compositional Evidence Grounding, which decomposes a global query into atomic sub-questions each requiring an explicit sub-answer and a localized evidence bounding box, so that the final answer emerges as a logical consequence of verified visual facts rather than a statistical guess.

Significance. If the claimed mechanism can be shown to enforce accurate sub-answers and boxes without prohibitive supervision or error propagation, the approach would offer a concrete route to interpretable, hallucination-resistant reasoning in VLMs.

major comments (3)

[Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.
[Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.
[Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.

minor comments (1)

The abstract and title are inconsistent in their description of the proposed method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The manuscript introduces H-GRPO as the underlying permutation-invariant RL method for training the De-compositional Evidence Grounding framework, but we agree the abstract could more explicitly connect these elements. We respond point-by-point below and will revise the abstract in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.

Authors: The abstract prioritizes the high-level motivation and the De-compositional Evidence Grounding mechanism. The full manuscript defines H-GRPO, including the permutation-invariant policy gradient objective, the reward formulation that enforces grounding, and the training procedure. We will revise the abstract to include a concise reference to the H-GRPO RL framework and its permutation-invariance property. revision: yes
Referee: [Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.

Authors: We accept that the abstract is too high-level to allow verification of the claims. The full paper contains the H-GRPO loss equations, dataset descriptions, training hyperparameters, main results, and ablations that quantify the contribution of the decomposition and grounding steps. We will expand the abstract to report key quantitative findings and mention the training setup. revision: yes
Referee: [Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.

Authors: This observation is correct; the current abstract (and the manuscript) does not explicitly analyze error propagation from incorrect sub-answers or boxes. We will add a dedicated paragraph in the revised manuscript that discusses this risk, presents any available robustness results, and compares end-to-end performance against monolithic baselines under controlled error injection. revision: yes

Circularity Check

0 steps flagged

No circularity: abstract and claims contain no derivations, fits, or self-citation chains

full rationale

The provided abstract and description present a high-level framework for query decomposition into sub-questions with bounding boxes, but contain no equations, parameters, loss functions, or RL formulations. No load-bearing step reduces to a fit, self-definition, or self-citation. The central claim about logical consequence from verified facts is presented as a design goal without any mathematical reduction to inputs. This is the common case of a self-contained descriptive proposal with no detectable circularity from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5661 in / 1015 out tokens · 33251 ms · 2026-06-30T06:14:28.223954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 12 canonical work pages · 9 internal anchors

[1]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

2018
[2]

Neural module networks

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016

2016
[3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. 11

2015
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[7]

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

Corentin Dancette, Remi Cadene, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1574–1583, 2021

2021
[8]

Gemini 3 flash: High-efficiency agentic multimodal understanding

Google DeepMind. Gemini 3 flash: High-efficiency agentic multimodal understanding. https: //ai.google.dev/gemini-api/docs/gemini-3, 2025. Accessed: May 2026

2025
[9]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations, 2018

2018
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Raven progressive matrices

John and Jean Raven. Raven progressive matrices. InHandbook of nonverbal assessment, pages 223–237. Springer, 2003

2003
[14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017
[15]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

2017
[16]

Imore: Implicit program-guided reasoning for human motion q&a

Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human motion q&a. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12987–12996, 2025. 12

2025
[17]

Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization

Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Chen, Dian Yu, Jordan Lee Boyd-Graber, Haitao Mi, et al. Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization. InThe F ourteenth International Conference on Learning Representations, 2026

2026
[18]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

2034
[19]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

2022
[20]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024

2024
[21]

A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

D Man and A Vision. A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

1982
[22]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019
[24]

Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning

Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fernando. Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning. AAAI, 2026

2026
[25]

Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, and Basura Fernando. Causalchaos! dataset for comprehensive causal action question answering over longer causal chains grounded in dynamic visual scenes.Advances in Neural Information Processing Systems, 37:92769–92802, 2024

2024
[26]

Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Grounding multimodal large language models to the world. 2024

2024
[27]

Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, and Cheston Tan. Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

work page arXiv 2023
[28]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

2019
[29]

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, volume 30, 2017. 13

2017
[30]

Grounded reinforcement learning for visual reasoning

Gabriel Herbert Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[31]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision, pages 146–162. Springer, 2022

2022
[32]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

2025
[35]

Language prior is not the only shortcut: A benchmark for shortcut learning in VQA

Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in VQA. InFindings of the Association for Computational Linguistics: EMNLP 2022, 2022

2022
[36]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

2025
[37]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

2024
[38]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[39]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Qwen3.5-Omni Technical Report

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[43]

Procedures as a representation for data in a computer program for understanding natural language

Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, 1971. 14

1971
[44]

Learning structural descriptions from examples

Patrick H Winston. Learning structural descriptions from examples. 1970

1970
[45]

Star: A benchmark for situated reasoning in real-world videos.ArXiv, abs/2405.09711, 2024

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

work page arXiv 2024
[46]

Realworldqa: A benchmark for real-world spatial understanding

xAI. Realworldqa: A benchmark for real-world spatial understanding. https://huggingface. co/datasets/visheratin/realworldqa, 2024. Benchmark released with the Grok-1.5 Vision preview. Accessed: 2026-05-22

2024
[47]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

2021
[48]

Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

2018
[49]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024

2024
[50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024
[51]

Raven: A dataset for relational and analogical visual reasoning

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019

2019
[52]

Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

Hao Zhang, Chen Li, and Basura Fernando. Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

work page arXiv 2025
[53]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

2025
[54]

Physreason: A comprehensive benchmark towards physics-based reasoning

Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16593–16615, 2025

2025
[55]

Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

2024
[56]

Visual7w: Grounded question answering in images

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016. 15

2016

[1] [1]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

2018

[2] [2]

Neural module networks

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016

2016

[3] [3]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. 11

2015

[4] [4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024

[7] [7]

Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

Corentin Dancette, Remi Cadene, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1574–1583, 2021

2021

[8] [8]

Gemini 3 flash: High-efficiency agentic multimodal understanding

Google DeepMind. Gemini 3 flash: High-efficiency agentic multimodal understanding. https: //ai.google.dev/gemini-api/docs/gemini-3, 2025. Accessed: May 2026

2025

[9] [9]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[10] [10]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Hudson and Christopher D

Drew A. Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations, 2018

2018

[12] [12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Raven progressive matrices

John and Jean Raven. Raven progressive matrices. InHandbook of nonverbal assessment, pages 223–237. Springer, 2003

2003

[14] [14]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

2017

[15] [15]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

2017

[16] [16]

Imore: Implicit program-guided reasoning for human motion q&a

Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human motion q&a. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12987–12996, 2025. 12

2025

[17] [17]

Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization

Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Chen, Dian Yu, Jordan Lee Boyd-Graber, Haitao Mi, et al. Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization. InThe F ourteenth International Conference on Learning Representations, 2026

2026

[18] [18]

Visual-rft: Visual reinforcement fine-tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

2034

[19] [19]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

2022

[20] [20]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024

2024

[21] [21]

A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

D Man and A Vision. A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

1982

[22] [22]

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

2019

[24] [24]

Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning

Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fernando. Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning. AAAI, 2026

2026

[25] [25]

Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, and Basura Fernando. Causalchaos! dataset for comprehensive causal action question answering over longer causal chains grounded in dynamic visual scenes.Advances in Neural Information Processing Systems, 37:92769–92802, 2024

2024

[26] [26]

Grounding multimodal large language models to the world

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Grounding multimodal large language models to the world. 2024

2024

[27] [27]

Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, and Cheston Tan. Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

work page arXiv 2023

[28] [28]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

2019

[29] [29]

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, volume 30, 2017. 13

2017

[30] [30]

Grounded reinforcement learning for visual reasoning

Gabriel Herbert Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[31] [31]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision, pages 146–162. Springer, 2022

2022

[32] [32]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

2024

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

2025

[35] [35]

Language prior is not the only shortcut: A benchmark for shortcut learning in VQA

Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in VQA. InFindings of the Association for Computational Linguistics: EMNLP 2022, 2022

2022

[36] [36]

Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

2025

[37] [37]

Aligning large multimodal models with factually augmented rlhf

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

2024

[38] [38]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[39] [39]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Qwen3.5-Omni Technical Report

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [42]

Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[43] [43]

Procedures as a representation for data in a computer program for understanding natural language

Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, 1971. 14

1971

[44] [44]

Learning structural descriptions from examples

Patrick H Winston. Learning structural descriptions from examples. 1970

1970

[45] [45]

Star: A benchmark for situated reasoning in real-world videos.ArXiv, abs/2405.09711, 2024

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

work page arXiv 2024

[46] [46]

Realworldqa: A benchmark for real-world spatial understanding

xAI. Realworldqa: A benchmark for real-world spatial understanding. https://huggingface. co/datasets/visheratin/realworldqa, 2024. Benchmark released with the Grok-1.5 Vision preview. Accessed: 2026-05-22

2024

[47] [47]

Next-qa: Next phase of question- answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

2021

[48] [48]

Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

2018

[49] [49]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024

2024

[50] [50]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024

[51] [51]

Raven: A dataset for relational and analogical visual reasoning

Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019

2019

[52] [52]

Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

Hao Zhang, Chen Li, and Basura Fernando. Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

work page arXiv 2025

[53] [53]

R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

2025

[54] [54]

Physreason: A comprehensive benchmark towards physics-based reasoning

Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16593–16615, 2025

2025

[55] [55]

Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

2024

[56] [56]

Visual7w: Grounded question answering in images

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016. 15

2016