H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning
Pith reviewed 2026-06-30 06:14 UTC · model grok-4.3
The pith
Vision-language models produce final answers as logical consequences of verified visual facts by decomposing queries into atomic sub-questions each paired with a localized bounding box.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By forcing decomposition of a query into atomic sub-questions that each demand an explicit sub-answer and a localized evidence bounding box, the model constructs a structured reasoning path in which the final answer emerges as a logical consequence of visually grounded facts.
What carries the argument
De-compositional Evidence Grounding, a process that decomposes global queries into atomic sub-questions each requiring a sub-answer and a localized bounding box to ground the step in the image.
If this is right
- The final answer is derived directly from verified visual facts instead of statistical guesses.
- Each intermediate reasoning step can be inspected by checking the supplied bounding box and sub-answer.
- Hallucinations are reduced because every claim in the chain must be tied to an explicit image region.
- The method applies to tasks requiring multi-step visual deduction such as container-liquid or environmental-context problems.
Where Pith is reading between the lines
- The added requirement for bounding-box output may increase the cost of data annotation or model supervision beyond standard VLM training.
- The same decomposition strategy could be tested on non-visual reasoning tasks if suitable localization signals can be defined.
- If the sub-question sequence is fixed in advance rather than learned, the approach may limit the model's ability to discover novel reasoning orders.
Load-bearing premise
Training or prompting the model to generate accurate atomic sub-answers and correct bounding boxes for every sub-question will not introduce new errors or demand prohibitive extra supervision, and the decomposition will improve rather than hurt final accuracy.
What would settle it
An experiment showing that models trained to output the required sub-answers and bounding boxes achieve equal or lower accuracy on visual reasoning benchmarks than standard VLMs without the decomposition step.
Figures
read the original abstract
Vision-Language Models (VLMs) often achieve high performance on benchmarks while remaining "black boxes", yet they remain prone to hallucination or rely on superficial shortcuts. In this work, we propose a framework designed to enhance both performance and interpretability through De-compositional Evidence Grounding. Unlike monolithic inference approaches, our approach forces the model to decompose a global query into a sequence of atomic sub-questions, each requiring an explicit sub-answer and critically a localized evidence bounding box. By grounding intermediate logical steps (e.g. identifying a container, analyzing liquid properties, and assessing environmental context) in specific visual regions, we construct a structured reasoning path that mirrors human-like deduction. This allows the final answer to emerge as a logical consequence of verified visual facts rather than a statistical guess.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes H-GRPO, described as a permutation-invariant reinforcement learning method for grounded visual reasoning in VLMs. It introduces De-compositional Evidence Grounding, which decomposes a global query into atomic sub-questions each requiring an explicit sub-answer and a localized evidence bounding box, so that the final answer emerges as a logical consequence of verified visual facts rather than a statistical guess.
Significance. If the claimed mechanism can be shown to enforce accurate sub-answers and boxes without prohibitive supervision or error propagation, the approach would offer a concrete route to interpretable, hallucination-resistant reasoning in VLMs.
major comments (3)
- [Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.
- [Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.
- [Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.
minor comments (1)
- The abstract and title are inconsistent in their description of the proposed method.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. The manuscript introduces H-GRPO as the underlying permutation-invariant RL method for training the De-compositional Evidence Grounding framework, but we agree the abstract could more explicitly connect these elements. We respond point-by-point below and will revise the abstract in the next version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.
Authors: The abstract prioritizes the high-level motivation and the De-compositional Evidence Grounding mechanism. The full manuscript defines H-GRPO, including the permutation-invariant policy gradient objective, the reward formulation that enforces grounding, and the training procedure. We will revise the abstract to include a concise reference to the H-GRPO RL framework and its permutation-invariance property. revision: yes
-
Referee: [Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.
Authors: We accept that the abstract is too high-level to allow verification of the claims. The full paper contains the H-GRPO loss equations, dataset descriptions, training hyperparameters, main results, and ablations that quantify the contribution of the decomposition and grounding steps. We will expand the abstract to report key quantitative findings and mention the training setup. revision: yes
-
Referee: [Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.
Authors: This observation is correct; the current abstract (and the manuscript) does not explicitly analyze error propagation from incorrect sub-answers or boxes. We will add a dedicated paragraph in the revised manuscript that discusses this risk, presents any available robustness results, and compares end-to-end performance against monolithic baselines under controlled error injection. revision: yes
Circularity Check
No circularity: abstract and claims contain no derivations, fits, or self-citation chains
full rationale
The provided abstract and description present a high-level framework for query decomposition into sub-questions with bounding boxes, but contain no equations, parameters, loss functions, or RL formulations. No load-bearing step reduces to a fit, self-definition, or self-citation. The central claim about logical consequence from verified facts is presented as a design goal without any mathematical reduction to inputs. This is the common case of a self-contained descriptive proposal with no detectable circularity from the given text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Don’t just assume; look and answer: Overcoming priors for visual question answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018
2018
-
[2]
Neural module networks
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016
2016
-
[3]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. 11
2015
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024
2024
-
[7]
Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering
Corentin Dancette, Remi Cadene, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1574–1583, 2021
2021
-
[8]
Gemini 3 flash: High-efficiency agentic multimodal understanding
Google DeepMind. Gemini 3 flash: High-efficiency agentic multimodal understanding. https: //ai.google.dev/gemini-api/docs/gemini-3, 2025. Accessed: May 2026
2025
-
[9]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
2025
-
[10]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Hudson and Christopher D
Drew A. Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations, 2018
2018
-
[12]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Raven progressive matrices
John and Jean Raven. Raven progressive matrices. InHandbook of nonverbal assessment, pages 223–237. Springer, 2003
2003
-
[14]
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017
2017
-
[15]
Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017
2017
-
[16]
Imore: Implicit program-guided reasoning for human motion q&a
Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human motion q&a. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12987–12996, 2025. 12
2025
-
[17]
Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization
Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Chen, Dian Yu, Jordan Lee Boyd-Graber, Haitao Mi, et al. Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization. InThe F ourteenth International Conference on Learning Representations, 2026
2026
-
[18]
Visual-rft: Visual reinforcement fine-tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025
2034
-
[19]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022
2022
-
[20]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024
2024
-
[21]
A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982
D Man and A Vision. A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982
1982
-
[22]
SmolVLM: Redefining small and efficient multimodal models
Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019
2019
-
[24]
Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning
Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fernando. Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning. AAAI, 2026
2026
-
[25]
Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, and Basura Fernando. Causalchaos! dataset for comprehensive causal action question answering over longer causal chains grounded in dynamic visual scenes.Advances in Neural Information Processing Systems, 37:92769–92802, 2024
2024
-
[26]
Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Grounding multimodal large language models to the world. 2024
2024
-
[27]
Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, and Cheston Tan. Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023
-
[28]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019
2019
-
[29]
Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, volume 30, 2017. 13
2017
-
[30]
Grounded reinforcement learning for visual reasoning
Gabriel Herbert Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[31]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision, pages 146–162. Springer, 2022
2022
-
[32]
Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024
2024
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025
2025
-
[35]
Language prior is not the only shortcut: A benchmark for shortcut learning in VQA
Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in VQA. InFindings of the Association for Computational Linguistics: EMNLP 2022, 2022
2022
-
[36]
Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025
2025
-
[37]
Aligning large multimodal models with factually augmented rlhf
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024
2024
-
[38]
Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Gemini Robotics: Bringing AI into the Physical World
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning
Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[43]
Procedures as a representation for data in a computer program for understanding natural language
Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, 1971. 14
1971
-
[44]
Learning structural descriptions from examples
Patrick H Winston. Learning structural descriptions from examples. 1970
1970
-
[45]
Star: A benchmark for situated reasoning in real-world videos.ArXiv, abs/2405.09711, 2024
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024
-
[46]
Realworldqa: A benchmark for real-world spatial understanding
xAI. Realworldqa: A benchmark for real-world spatial understanding. https://huggingface. co/datasets/visheratin/realworldqa, 2024. Benchmark released with the Grok-1.5 Vision preview. Accessed: 2026-05-22
2024
-
[47]
Next-qa: Next phase of question- answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021
2021
-
[48]
Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018
Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018
2018
-
[49]
Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback
Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024
2024
-
[50]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024
2024
-
[51]
Raven: A dataset for relational and analogical visual reasoning
Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019
2019
-
[52]
Hao Zhang, Chen Li, and Basura Fernando. Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025
-
[53]
R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025
2025
-
[54]
Physreason: A comprehensive benchmark towards physics-based reasoning
Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16593–16615, 2025
2025
-
[55]
Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024
2024
-
[56]
Visual7w: Grounded question answering in images
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016. 15
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.