pith. sign in

arxiv: 2606.29915 · v1 · pith:CJHWHTBLnew · submitted 2026-06-29 · 💻 cs.CV

H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning

Pith reviewed 2026-06-30 06:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords Vision-Language ModelsGrounded Visual ReasoningDe-compositional Evidence GroundingInterpretabilityHallucination ReductionAtomic Sub-questionsBounding Box Grounding
0
0 comments X

The pith

Vision-language models produce final answers as logical consequences of verified visual facts by decomposing queries into atomic sub-questions each paired with a localized bounding box.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces De-compositional Evidence Grounding to address hallucination and shortcut reliance in VLMs. The approach requires the model to break a global query into a sequence of atomic sub-questions, supplying both a sub-answer and an explicit bounding box for each. Grounding each intermediate step in a specific image region creates a chain of verified facts. The final answer then follows directly from those facts rather than from statistical patterns alone. This structure is intended to improve both accuracy and the ability to inspect the reasoning process.

Core claim

By forcing decomposition of a query into atomic sub-questions that each demand an explicit sub-answer and a localized evidence bounding box, the model constructs a structured reasoning path in which the final answer emerges as a logical consequence of visually grounded facts.

What carries the argument

De-compositional Evidence Grounding, a process that decomposes global queries into atomic sub-questions each requiring a sub-answer and a localized bounding box to ground the step in the image.

If this is right

  • The final answer is derived directly from verified visual facts instead of statistical guesses.
  • Each intermediate reasoning step can be inspected by checking the supplied bounding box and sub-answer.
  • Hallucinations are reduced because every claim in the chain must be tied to an explicit image region.
  • The method applies to tasks requiring multi-step visual deduction such as container-liquid or environmental-context problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The added requirement for bounding-box output may increase the cost of data annotation or model supervision beyond standard VLM training.
  • The same decomposition strategy could be tested on non-visual reasoning tasks if suitable localization signals can be defined.
  • If the sub-question sequence is fixed in advance rather than learned, the approach may limit the model's ability to discover novel reasoning orders.

Load-bearing premise

Training or prompting the model to generate accurate atomic sub-answers and correct bounding boxes for every sub-question will not introduce new errors or demand prohibitive extra supervision, and the decomposition will improve rather than hurt final accuracy.

What would settle it

An experiment showing that models trained to output the required sub-answers and bounding boxes achieve equal or lower accuracy on visual reasoning benchmarks than standard VLMs without the decomposition step.

Figures

Figures reproduced from arXiv: 2606.29915 by Basura Fernando, Debaditya Roy, Eric Peh.

Figure 1
Figure 1. Figure 1: Overview of our grounded reasoning. Given an image I and question [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative examples of grounded reasoning decomposition. For each image-question pair, [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) often achieve high performance on benchmarks while remaining "black boxes", yet they remain prone to hallucination or rely on superficial shortcuts. In this work, we propose a framework designed to enhance both performance and interpretability through De-compositional Evidence Grounding. Unlike monolithic inference approaches, our approach forces the model to decompose a global query into a sequence of atomic sub-questions, each requiring an explicit sub-answer and critically a localized evidence bounding box. By grounding intermediate logical steps (e.g. identifying a container, analyzing liquid properties, and assessing environmental context) in specific visual regions, we construct a structured reasoning path that mirrors human-like deduction. This allows the final answer to emerge as a logical consequence of verified visual facts rather than a statistical guess.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes H-GRPO, described as a permutation-invariant reinforcement learning method for grounded visual reasoning in VLMs. It introduces De-compositional Evidence Grounding, which decomposes a global query into atomic sub-questions each requiring an explicit sub-answer and a localized evidence bounding box, so that the final answer emerges as a logical consequence of verified visual facts rather than a statistical guess.

Significance. If the claimed mechanism can be shown to enforce accurate sub-answers and boxes without prohibitive supervision or error propagation, the approach would offer a concrete route to interpretable, hallucination-resistant reasoning in VLMs.

major comments (3)
  1. [Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.
  2. [Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.
  3. [Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.
minor comments (1)
  1. The abstract and title are inconsistent in their description of the proposed method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The manuscript introduces H-GRPO as the underlying permutation-invariant RL method for training the De-compositional Evidence Grounding framework, but we agree the abstract could more explicitly connect these elements. We respond point-by-point below and will revise the abstract in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the title centers on H-GRPO and permutation-invariant RL, yet the abstract contains no reference to reinforcement learning, any loss function, policy, or permutation-invariance mechanism; the central claim therefore rests on an unshown training procedure.

    Authors: The abstract prioritizes the high-level motivation and the De-compositional Evidence Grounding mechanism. The full manuscript defines H-GRPO, including the permutation-invariant policy gradient objective, the reward formulation that enforces grounding, and the training procedure. We will revise the abstract to include a concise reference to the H-GRPO RL framework and its permutation-invariance property. revision: yes

  2. Referee: [Abstract] Abstract: no equations, datasets, training details, results, or ablation studies are supplied, so the assertion that grounding intermediate steps produces 'verified visual facts' whose logical combination yields the final answer cannot be checked against any empirical or formal evidence.

    Authors: We accept that the abstract is too high-level to allow verification of the claims. The full paper contains the H-GRPO loss equations, dataset descriptions, training hyperparameters, main results, and ablations that quantify the contribution of the decomposition and grounding steps. We will expand the abstract to report key quantitative findings and mention the training setup. revision: yes

  3. Referee: [Abstract] Abstract: the decomposition into atomic sub-questions with bounding boxes is presented as improving accuracy and interpretability, but no argument or result addresses whether faulty sub-answers or boxes would propagate errors and degrade end-task performance relative to monolithic inference.

    Authors: This observation is correct; the current abstract (and the manuscript) does not explicitly analyze error propagation from incorrect sub-answers or boxes. We will add a dedicated paragraph in the revised manuscript that discusses this risk, presents any available robustness results, and compares end-to-end performance against monolithic baselines under controlled error injection. revision: yes

Circularity Check

0 steps flagged

No circularity: abstract and claims contain no derivations, fits, or self-citation chains

full rationale

The provided abstract and description present a high-level framework for query decomposition into sub-questions with bounding boxes, but contain no equations, parameters, loss functions, or RL formulations. No load-bearing step reduces to a fit, self-definition, or self-citation. The central claim about logical consequence from verified facts is presented as a design goal without any mathematical reduction to inputs. This is the common case of a self-contained descriptive proposal with no detectable circularity from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5661 in / 1015 out tokens · 33251 ms · 2026-06-30T06:14:28.223954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

  2. [2]

    Neural module networks

    Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48, 2016

  3. [3]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015. 11

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

  5. [5]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  6. [6]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  7. [7]

    Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering

    Corentin Dancette, Remi Cadene, Damien Teney, and Matthieu Cord. Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1574–1583, 2021

  8. [8]

    Gemini 3 flash: High-efficiency agentic multimodal understanding

    Google DeepMind. Gemini 3 flash: High-efficiency agentic multimodal understanding. https: //ai.google.dev/gemini-api/docs/gemini-3, 2025. Accessed: May 2026

  9. [9]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  10. [10]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  11. [11]

    Hudson and Christopher D

    Drew A. Hudson and Christopher D. Manning. Compositional attention networks for machine reasoning. InInternational Conference on Learning Representations, 2018

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  13. [13]

    Raven progressive matrices

    John and Jean Raven. Raven progressive matrices. InHandbook of nonverbal assessment, pages 223–237. Springer, 2003

  14. [14]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  15. [15]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123(1):32–73, 2017

  16. [16]

    Imore: Implicit program-guided reasoning for human motion q&a

    Chen Li, Chinthani Sugandhika, Yeo Keat Ee, Eric Peh, Hao Zhang, Hong Yang, Deepu Rajan, and Basura Fernando. Imore: Implicit program-guided reasoning for human motion q&a. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12987–12996, 2025. 12

  17. [17]

    Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization

    Zongxia Li, Wenhao Yu, Chengsong Huang, Zhenwen Liang, Rui Liu, Fuxiao Liu, Jingxi Chen, Dian Yu, Jordan Lee Boyd-Graber, Haitao Mi, et al. Vision-sr1: Self-rewarding vision-language model via reasoning decomposition and multi-reward policy optimization. InThe F ourteenth International Conference on Learning Representations, 2026

  18. [18]

    Visual-rft: Visual reinforcement fine-tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2034–2044, 2025

  19. [19]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

  20. [20]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations, volume 2024, pages 23439–23554, 2024

  21. [21]

    A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

    D Man and A Vision. A computational investigation into the human representation and processing of visual information.WH San Francisco: Freeman and Company, San Francisco, 1(1):4, 1982

  22. [22]

    SmolVLM: Redefining small and efficient multimodal models

    Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Vaibhav Srivastav, Joshua Lochner, Hugo Larcher, Mathieu Morlon, Lewis Tunstall, Leandro von Werra, and Thomas Wolf. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

  23. [23]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019

  24. [24]

    Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning

    Thanh-Son Nguyen, Hong Yang, Tzeh Yuan Neoh, Hao Zhang, Ee Yeo Keat, and Basura Fernando. Pkr-qa: A benchmark for procedural knowledge reasoning with knowledge module learning. AAAI, 2026

  25. [25]

    Paritosh Parmar, Eric Peh, Ruirui Chen, Ting En Lam, Yuhan Chen, Elston Tan, and Basura Fernando. Causalchaos! dataset for comprehensive causal action question answering over longer causal chains grounded in dynamic visual scenes.Advances in Neural Information Processing Systems, 37:92769–92802, 2024

  26. [26]

    Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Grounding multimodal large language models to the world. 2024

  27. [27]

    Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

    Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, and Cheston Tan. Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023

  28. [28]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

  29. [29]

    Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, volume 30, 2017. 13

  30. [30]

    Grounded reinforcement learning for visual reasoning

    Gabriel Herbert Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J Tarr, Aviral Kumar, and Katerina Fragkiadaki. Grounded reinforcement learning for visual reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  31. [31]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InEuropean conference on computer vision, pages 146–162. Springer, 2022

  32. [32]

    Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv e-prints, pages arXiv–2504, 2025

  35. [35]

    Language prior is not the only shortcut: A benchmark for shortcut learning in VQA

    Qingyi Si, Fandong Meng, Mingyu Zheng, Zheng Lin, Yuanxin Liu, Peng Fu, Yanan Cao, Weiping Wang, and Jie Zhou. Language prior is not the only shortcut: A benchmark for shortcut learning in VQA. InFindings of the Association for Computational Linguistics: EMNLP 2022, 2022

  36. [36]

    Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 15768–15780, 2025

  37. [37]

    Aligning large multimodal models with factually augmented rlhf

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110, 2024

  38. [38]

    Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  39. [39]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  40. [40]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montser- rat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  41. [41]

    Qwen3.5-Omni Technical Report

    Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804

  42. [42]

    Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025

  43. [43]

    Procedures as a representation for data in a computer program for understanding natural language

    Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, 1971. 14

  44. [44]

    Learning structural descriptions from examples

    Patrick H Winston. Learning structural descriptions from examples. 1970

  45. [45]

    Star: A benchmark for situated reasoning in real-world videos.ArXiv, abs/2405.09711, 2024

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

  46. [46]

    Realworldqa: A benchmark for real-world spatial understanding

    xAI. Realworldqa: A benchmark for real-world spatial understanding. https://huggingface. co/datasets/visheratin/realworldqa, 2024. Benchmark released with the Grok-1.5 Vision preview. Accessed: 2026-05-22

  47. [47]

    Next-qa: Next phase of question- answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question- answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021

  48. [48]

    Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

    Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. Neural- symbolic vqa: Disentangling reasoning from vision and language understanding.Advances in neural information processing systems, 31, 2018

  49. [49]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024

  50. [50]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  51. [51]

    Raven: A dataset for relational and analogical visual reasoning

    Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. Raven: A dataset for relational and analogical visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019

  52. [52]

    Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

    Hao Zhang, Chen Li, and Basura Fernando. Mitigating easy option bias in multiple-choice question answering.arXiv preprint arXiv:2508.13428, 2025

  53. [53]

    R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization

    Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, and Dacheng Tao. R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1859–1869, October 2025

  54. [54]

    Physreason: A comprehensive benchmark towards physics-based reasoning

    Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, and Jun Liu. Physreason: A comprehensive benchmark towards physics-based reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16593–16615, 2025

  55. [55]

    Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024, 2024

  56. [56]

    Visual7w: Grounded question answering in images

    Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004, 2016. 15