DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Feng Chen; Guanyu Chen; Haichuan Gao; Haoda Jing; Jun Feng; Tianren Zhang; Tianrun Xu; Ye Li; Yuquan Wei

arxiv: 2509.20912 · v4 · pith:MSDIYUTCnew · submitted 2025-09-25 · 💻 cs.AI

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Tianrun Xu , Haoda Jing , Ye Li , Yuquan Wei , Jun Feng , Guanyu Chen , Haichuan Gao , Tianren Zhang

show 1 more author

Feng Chen

This is my paper

Pith reviewed 2026-05-22 13:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords counterfactual reasoningmultimodal language modelsevidence consistencyfaithful reasoningvisual question answeringreinforcement learning

0 comments

The pith

Counterfactual image variants train multimodal models to ground answers in correct visual evidence

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal language models often produce correct answers without relying on the right parts of an image. DeFacto addresses this by creating a dataset of counterfactual images that alter only the question-relevant evidence while keeping the question the same. It then trains models using reinforcement learning with rewards for correct answers, structured reasoning, and selecting consistent evidence. The result is better performance on accuracy metrics and a new measure of how well answers match their supporting evidence. This matters because faithful reasoning requires not just right answers but right reasons based on the actual input.

Core claim

DeFacto is a counterfactual reasoning framework that aligns visual evidence with final answers in multimodal language models through three training paradigms: positive, counterfactual, and random-masking. It uses a language-guided pipeline to build the DeFacto-100K dataset of localized regions and variants, trains with GRPO-based reinforcement learning and three rewards for accuracy, reasoning, and consistency, and evaluates on the DeFacto-1.5K benchmark showing improvements in both answer accuracy and evidence-answer consistency.

What carries the argument

language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants preserving original question semantics

If this is right

Higher answer accuracy on diverse multimodal reasoning benchmarks
Improved consistency between chosen visual evidence and final answers
Successful scaling via the automatically generated DeFacto-100K dataset
New systematic evaluation of grounding quality beyond accuracy alone

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar counterfactual construction could reduce hallucinations in other vision-language settings
The method suggests a general route toward more verifiable multimodal decision processes
Extensions might test the same pipeline on video sequences or multi-image inputs

Load-bearing premise

The language-guided evidence construction pipeline automatically localizes question-relevant regions and generates valid counterfactual variants that preserve the original question semantics while changing only the targeted visual evidence.

What would settle it

If models trained without the counterfactual component show the same levels of evidence-answer inconsistency as strong baselines when measured on the human-annotated DeFacto-1.5K benchmark, the value of the added training paradigms would be questioned.

Figures

Figures reproduced from arXiv: 2509.20912 by Feng Chen, Guanyu Chen, Haichuan Gao, Haoda Jing, Jun Feng, Tianren Zhang, Tianrun Xu, Ye Li, Yuquan Wei.

**Figure 2.** Figure 2: An overview of our counterfactual framework with three inputs: positive (full evidence), [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Reward curves during training. Each subplot corresponds to one component of the reward: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeFacto adds an automatic language-guided pipeline for counterfactual image variants plus GRPO rewards to push evidence consistency in MLLMs, but the lack of checks on that pipeline is the part that needs watching.

read the letter

The main thing here is the automatic construction of DeFacto-100K through language-guided localization of relevant image regions followed by generation of counterfactual variants that are supposed to change only the targeted evidence. They layer this with positive examples, random masking, and GRPO-based RL using three rewards that target correct answers, structured reasoning, and evidence-answer consistency. They also release DeFacto-1.5K as a human-annotated test set for measuring consistency separately from raw accuracy. The abstract claims clear gains on both metrics over baselines across several benchmarks. This setup targets a genuine problem in multimodal models where answers can be right for the wrong visual reasons, and the combination of training modes plus the dedicated benchmark is a concrete step that could be picked up by others working on faithful reasoning. The approach builds on existing counterfactual ideas without simply repeating them. The central risk is exactly the one in the stress-test note. The whole training loop depends on the pipeline producing variants that preserve question semantics and isolate only the intended visual change. If localization is imprecise or the variants introduce extra shifts, the positive, counterfactual, and masking data could all carry label noise or spurious correlations, which would make the reported improvements hard to attribute to better evidence grounding. The abstract gives no numbers, no ablation results, no dataset statistics, and no mention of human verification on the 100K set, so that assumption stays untested in what is shown. The methods themselves look standard for this style of RL fine-tuning with no obvious circularity or internal contradictions. This is the sort of paper that would interest groups working on reliable vision-language systems and on evaluation benchmarks for grounding. The benchmark in particular could see some use even if the main training claims need more support. I would send it to peer review. The problem is worth community attention and the framework is a reasonable attempt, but referees will have to press on data quality and demand the missing quantitative details.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces DeFacto, a counterfactual reasoning framework for multimodal language models (MLLMs) to enforce evidence-answer consistency. It combines three training paradigms (positive, counterfactual, and random-masking) built on a language-guided evidence construction pipeline that automatically localizes relevant image regions and generates variants, yielding the DeFacto-100K dataset. Models are trained via GRPO-based reinforcement learning with three rewards targeting correct answers, structured reasoning, and evidence consistency. The work also releases the human-annotated DeFacto-1.5K benchmark for evaluating consistency beyond accuracy and reports substantial gains in both accuracy and evidence-answer consistency over baselines on diverse benchmarks.

Significance. If the central results hold, the work would be a meaningful contribution to multimodal reasoning by directly targeting evidence-answer consistency, a recognized weakness in current MLLMs. The combination of counterfactual data generation with multi-reward GRPO training and the release of both a large training set and a dedicated consistency benchmark are positive elements that could support follow-on research. The approach is internally coherent in its design and addresses a load-bearing practical problem rather than an incremental accuracy tweak.

major comments (1)

[Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.

minor comments (2)

[Abstract] The abstract states improvements but does not include any numerical results, dataset statistics, or error bars; moving a concise summary of key metrics (e.g., accuracy and consistency deltas with standard deviations) into the abstract would improve readability.
[Training objective] Notation for the three GRPO rewards and the exact form of the consistency reward could be clarified with a short equation or pseudocode block to make the training objective easier to reproduce.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of DeFacto in addressing evidence-answer consistency in MLLMs. We provide a point-by-point response to the major comment below, along with planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.

Authors: We agree that explicit verification of semantic preservation is valuable to strengthen confidence in the automatic pipeline and rule out potential artifacts or label noise. The language-guided evidence construction pipeline (detailed in §3) is designed to localize question-relevant regions via the MLLM's own reasoning trace and then generate counterfactual variants by editing only those regions (e.g., object replacement or attribute change) while leaving the question text, non-evidence image areas, and overall scene semantics intact. We already include multiple qualitative examples in Figure 3 and Appendix A that illustrate preserved question semantics across positive, counterfactual, and random-masking cases. In addition, the consistent gains on the human-annotated DeFacto-1.5K benchmark—which directly measures evidence-answer alignment—provide indirect support that the training signals are effective rather than spurious. To directly address the referee's concern, we will add a human verification study: we will randomly sample 500 examples from DeFacto-100K, have two independent annotators rate semantic preservation and whether only the targeted evidence was altered, and report agreement rates plus inter-annotator agreement (Cohen's kappa). These results and a description of the protocol will be added to §3 and a new appendix in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks and no self-referential derivations

full rationale

The paper describes a data-generation pipeline, GRPO training, and three rewards to improve evidence-answer consistency, with results measured on external benchmarks plus a new human-annotated DeFacto-1.5K set. No equations, first-principles derivations, or fitted parameters are presented that reduce any claimed improvement to an input defined by the same data. The language-guided evidence construction is an engineering step whose validity is an empirical assumption, not a definitional loop or self-citation that forces the outcome. Central claims rest on reported accuracy and consistency gains rather than any reduction to the pipeline outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the automatic pipeline produces high-quality counterfactual examples and that the three rewards in GRPO training reliably promote evidence consistency without introducing new biases.

axioms (1)

domain assumption The language-guided evidence construction pipeline accurately localizes question-relevant regions and generates valid counterfactual variants.
This premise is required for the training data to enforce genuine evidence-answer alignment rather than artifacts of the generation process.

pith-pipeline@v0.9.0 · 5732 in / 1324 out tokens · 31875 ms · 2026-05-22T13:32:03.736879+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Semantic-Enriched Latent Visual Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 2 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

work page arXiv
[5]

Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

work page arXiv 1909
[6]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

work page arXiv
[8]

A Survey on Optical Character Recognition System

Noman Islam, Zeeshan Islam, and Nazia Noor. A survey on optical character recognition system. arXiv preprint arXiv:1710.05703,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 235–251. Springer,

work page 2016
[10]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv...

work page arXiv
[12]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Microsoft. Introducing gpt-4o-2024-08-06 api with structured outputs on azure.https://techcommunity.microsoft.com/blog/azure-ai-services-blog/ introducing-gpt-4o-2024-08-06-api-with-structured-outputs-on-azure/4232684,

work page arXiv 2024
[14]

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty

Accessed: 2025-03-07. Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR,

work page 2025
[15]

Accessed: 2025-08-06

URLhttps://openai.com/index/thinking-with-images/. Accessed: 2025-08-06. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744,

work page 2025
[16]

Compositional Semantic Parsing on Semi-Structured Tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

work page arXiv
[18]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Cogcom: A visual language model with chain-of-manipulations reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of-manipulations reasoning. arXiv preprint arXiv:2402.04236,

work page arXiv
[20]

org/abs/2411.14347

URLhttps://arxiv. org/abs/2411.14347. 12 Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642,

work page arXiv
[21]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862,

work page arXiv
[22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works

Xue Wang, Zhibo Wang, Haiqin Weng, Hengchang Guo, Zhifei Zhang, Lu Jin, Tao Wei, and Kui Ren. Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works. InProceedings of the IEEE/CVF international conference on computer vision, pp. 2042– 2051,

work page 2042
[25]

Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a

Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a. Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details...

work page arXiv
[26]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

13 Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

work page arXiv

[5] [5]

Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

work page arXiv 1909

[6] [6]

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

work page arXiv

[8] [8]

A Survey on Optical Character Recognition System

Noman Islam, Zeeshan Islam, and Nazia Noor. A survey on optical character recognition system. arXiv preprint arXiv:1710.05703,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 235–251. Springer,

work page 2016

[10] [10]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv...

work page arXiv

[12] [12]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Microsoft. Introducing gpt-4o-2024-08-06 api with structured outputs on azure.https://techcommunity.microsoft.com/blog/azure-ai-services-blog/ introducing-gpt-4o-2024-08-06-api-with-structured-outputs-on-azure/4232684,

work page arXiv 2024

[14] [14]

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty

Accessed: 2025-03-07. Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR,

work page 2025

[15] [15]

Accessed: 2025-08-06

URLhttps://openai.com/index/thinking-with-images/. Accessed: 2025-08-06. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744,

work page 2025

[16] [16]

Compositional Semantic Parsing on Semi-Structured Tables

Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

work page arXiv

[18] [18]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Cogcom: A visual language model with chain-of-manipulations reasoning

Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of-manipulations reasoning. arXiv preprint arXiv:2402.04236,

work page arXiv

[20] [20]

org/abs/2411.14347

URLhttps://arxiv. org/abs/2411.14347. 12 Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642,

work page arXiv

[21] [21]

Visual agents as fast and slow thinkers

Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862,

work page arXiv

[22] [22]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works

Xue Wang, Zhibo Wang, Haiqin Weng, Hengchang Guo, Zhifei Zhang, Lu Jin, Tao Wei, and Kui Ren. Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works. InProceedings of the IEEE/CVF international conference on computer vision, pp. 2042– 2051,

work page 2042

[25] [25]

Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a

Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a. Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details...

work page arXiv

[26] [26]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

13 Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv