DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
Pith reviewed 2026-05-22 13:32 UTC · model grok-4.3
The pith
Counterfactual image variants train multimodal models to ground answers in correct visual evidence
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeFacto is a counterfactual reasoning framework that aligns visual evidence with final answers in multimodal language models through three training paradigms: positive, counterfactual, and random-masking. It uses a language-guided pipeline to build the DeFacto-100K dataset of localized regions and variants, trains with GRPO-based reinforcement learning and three rewards for accuracy, reasoning, and consistency, and evaluates on the DeFacto-1.5K benchmark showing improvements in both answer accuracy and evidence-answer consistency.
What carries the argument
language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants preserving original question semantics
If this is right
- Higher answer accuracy on diverse multimodal reasoning benchmarks
- Improved consistency between chosen visual evidence and final answers
- Successful scaling via the automatically generated DeFacto-100K dataset
- New systematic evaluation of grounding quality beyond accuracy alone
Where Pith is reading between the lines
- Similar counterfactual construction could reduce hallucinations in other vision-language settings
- The method suggests a general route toward more verifiable multimodal decision processes
- Extensions might test the same pipeline on video sequences or multi-image inputs
Load-bearing premise
The language-guided evidence construction pipeline automatically localizes question-relevant regions and generates valid counterfactual variants that preserve the original question semantics while changing only the targeted visual evidence.
What would settle it
If models trained without the counterfactual component show the same levels of evidence-answer inconsistency as strong baselines when measured on the human-annotated DeFacto-1.5K benchmark, the value of the added training paradigms would be questioned.
Figures
read the original abstract
Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeFacto, a counterfactual reasoning framework for multimodal language models (MLLMs) to enforce evidence-answer consistency. It combines three training paradigms (positive, counterfactual, and random-masking) built on a language-guided evidence construction pipeline that automatically localizes relevant image regions and generates variants, yielding the DeFacto-100K dataset. Models are trained via GRPO-based reinforcement learning with three rewards targeting correct answers, structured reasoning, and evidence consistency. The work also releases the human-annotated DeFacto-1.5K benchmark for evaluating consistency beyond accuracy and reports substantial gains in both accuracy and evidence-answer consistency over baselines on diverse benchmarks.
Significance. If the central results hold, the work would be a meaningful contribution to multimodal reasoning by directly targeting evidence-answer consistency, a recognized weakness in current MLLMs. The combination of counterfactual data generation with multi-reward GRPO training and the release of both a large training set and a dedicated consistency benchmark are positive elements that could support follow-on research. The approach is internally coherent in its design and addresses a load-bearing practical problem rather than an incremental accuracy tweak.
major comments (1)
- [Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.
minor comments (2)
- [Abstract] The abstract states improvements but does not include any numerical results, dataset statistics, or error bars; moving a concise summary of key metrics (e.g., accuracy and consistency deltas with standard deviations) into the abstract would improve readability.
- [Training objective] Notation for the three GRPO rewards and the exact form of the consistency reward could be clarified with a short equation or pseudocode block to make the training objective easier to reproduce.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of DeFacto in addressing evidence-answer consistency in MLLMs. We provide a point-by-point response to the major comment below, along with planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.
Authors: We agree that explicit verification of semantic preservation is valuable to strengthen confidence in the automatic pipeline and rule out potential artifacts or label noise. The language-guided evidence construction pipeline (detailed in §3) is designed to localize question-relevant regions via the MLLM's own reasoning trace and then generate counterfactual variants by editing only those regions (e.g., object replacement or attribute change) while leaving the question text, non-evidence image areas, and overall scene semantics intact. We already include multiple qualitative examples in Figure 3 and Appendix A that illustrate preserved question semantics across positive, counterfactual, and random-masking cases. In addition, the consistent gains on the human-annotated DeFacto-1.5K benchmark—which directly measures evidence-answer alignment—provide indirect support that the training signals are effective rather than spurious. To directly address the referee's concern, we will add a human verification study: we will randomly sample 500 examples from DeFacto-100K, have two independent annotators rate semantic preservation and whether only the targeted evidence was altered, and report agreement rates plus inter-annotator agreement (Cohen's kappa). These results and a description of the protocol will be added to §3 and a new appendix in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical method with external benchmarks and no self-referential derivations
full rationale
The paper describes a data-generation pipeline, GRPO training, and three rewards to improve evidence-answer consistency, with results measured on external benchmarks plus a new human-annotated DeFacto-1.5K set. No equations, first-principles derivations, or fitted parameters are presented that reduce any claimed improvement to an input defined by the same data. The language-guided evidence construction is an engineering step whose validity is an empirical assumption, not a definitional loop or self-citation that forces the outcome. Central claims rest on reported accuracy and consistency gains rather than any reduction to the pipeline outputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The language-guided evidence construction pipeline accurately localizes question-relevant regions and generates valid counterfactual variants.
Forward citations
Cited by 2 Pith papers
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
Semantic-Enriched Latent Visual Reasoning
SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,
-
[5]
Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,
-
[6]
GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,
-
[8]
A Survey on Optical Character Recognition System
Noman Islam, Zeeshan Islam, and Nazia Noor. A survey on optical character recognition system. arXiv preprint arXiv:1710.05703,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 235–251. Springer,
work page 2016
-
[10]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv...
-
[12]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,
work page internal anchor Pith review Pith/arXiv arXiv
- [13]
-
[14]
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty
Accessed: 2025-03-07. Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR,
work page 2025
-
[15]
URLhttps://openai.com/index/thinking-with-images/. Accessed: 2025-08-06. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744,
work page 2025
-
[16]
Compositional Semantic Parsing on Semi-Structured Tables
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,
-
[18]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Cogcom: A visual language model with chain-of-manipulations reasoning
Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of-manipulations reasoning. arXiv preprint arXiv:2402.04236,
-
[20]
URLhttps://arxiv. org/abs/2411.14347. 12 Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642,
-
[21]
Visual agents as fast and slow thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862,
-
[22]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works
Xue Wang, Zhibo Wang, Haiqin Weng, Hengchang Guo, Zhifei Zhang, Lu Jin, Tao Wei, and Kui Ren. Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works. InProceedings of the IEEE/CVF international conference on computer vision, pp. 2042– 2051,
work page 2042
-
[25]
Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a. Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details...
-
[26]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
13 Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.