pith. sign in

arxiv: 2509.20912 · v4 · pith:MSDIYUTCnew · submitted 2025-09-25 · 💻 cs.AI

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

Pith reviewed 2026-05-22 13:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords counterfactual reasoningmultimodal language modelsevidence consistencyfaithful reasoningvisual question answeringreinforcement learning
0
0 comments X

The pith

Counterfactual image variants train multimodal models to ground answers in correct visual evidence

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal language models often produce correct answers without relying on the right parts of an image. DeFacto addresses this by creating a dataset of counterfactual images that alter only the question-relevant evidence while keeping the question the same. It then trains models using reinforcement learning with rewards for correct answers, structured reasoning, and selecting consistent evidence. The result is better performance on accuracy metrics and a new measure of how well answers match their supporting evidence. This matters because faithful reasoning requires not just right answers but right reasons based on the actual input.

Core claim

DeFacto is a counterfactual reasoning framework that aligns visual evidence with final answers in multimodal language models through three training paradigms: positive, counterfactual, and random-masking. It uses a language-guided pipeline to build the DeFacto-100K dataset of localized regions and variants, trains with GRPO-based reinforcement learning and three rewards for accuracy, reasoning, and consistency, and evaluates on the DeFacto-1.5K benchmark showing improvements in both answer accuracy and evidence-answer consistency.

What carries the argument

language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants preserving original question semantics

If this is right

  • Higher answer accuracy on diverse multimodal reasoning benchmarks
  • Improved consistency between chosen visual evidence and final answers
  • Successful scaling via the automatically generated DeFacto-100K dataset
  • New systematic evaluation of grounding quality beyond accuracy alone

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar counterfactual construction could reduce hallucinations in other vision-language settings
  • The method suggests a general route toward more verifiable multimodal decision processes
  • Extensions might test the same pipeline on video sequences or multi-image inputs

Load-bearing premise

The language-guided evidence construction pipeline automatically localizes question-relevant regions and generates valid counterfactual variants that preserve the original question semantics while changing only the targeted visual evidence.

What would settle it

If models trained without the counterfactual component show the same levels of evidence-answer inconsistency as strong baselines when measured on the human-annotated DeFacto-1.5K benchmark, the value of the added training paradigms would be questioned.

Figures

Figures reproduced from arXiv: 2509.20912 by Feng Chen, Guanyu Chen, Haichuan Gao, Haoda Jing, Jun Feng, Tianren Zhang, Tianrun Xu, Ye Li, Yuquan Wei.

Figure 1
Figure 1. Figure 1: Qualitative examples of failure cases. Left: Mislocalized Failure (park scene). Right: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our counterfactual framework with three inputs: positive (full evidence), [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reward curves during training. Each subplot corresponds to one component of the reward: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces DeFacto, a counterfactual reasoning framework for multimodal language models (MLLMs) to enforce evidence-answer consistency. It combines three training paradigms (positive, counterfactual, and random-masking) built on a language-guided evidence construction pipeline that automatically localizes relevant image regions and generates variants, yielding the DeFacto-100K dataset. Models are trained via GRPO-based reinforcement learning with three rewards targeting correct answers, structured reasoning, and evidence consistency. The work also releases the human-annotated DeFacto-1.5K benchmark for evaluating consistency beyond accuracy and reports substantial gains in both accuracy and evidence-answer consistency over baselines on diverse benchmarks.

Significance. If the central results hold, the work would be a meaningful contribution to multimodal reasoning by directly targeting evidence-answer consistency, a recognized weakness in current MLLMs. The combination of counterfactual data generation with multi-reward GRPO training and the release of both a large training set and a dedicated consistency benchmark are positive elements that could support follow-on research. The approach is internally coherent in its design and addresses a load-bearing practical problem rather than an incremental accuracy tweak.

major comments (1)
  1. [Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.
minor comments (2)
  1. [Abstract] The abstract states improvements but does not include any numerical results, dataset statistics, or error bars; moving a concise summary of key metrics (e.g., accuracy and consistency deltas with standard deviations) into the abstract would improve readability.
  2. [Training objective] Notation for the three GRPO rewards and the exact form of the consistency reward could be clarified with a short equation or pseudocode block to make the training objective easier to reproduce.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of DeFacto in addressing evidence-answer consistency in MLLMs. We provide a point-by-point response to the major comment below, along with planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Language-guided evidence construction pipeline] Language-guided evidence construction pipeline (abstract and §3): the central claim that training on DeFacto-100K plus the three GRPO rewards produces genuine evidence-answer consistency rather than artifacts rests on the unverified assumption that the automatic localization and counterfactual generation preserve question semantics while altering only the targeted visual evidence. No human verification, inter-annotator agreement, or quantitative semantic-preservation metrics on any sample of the 100K set are reported; without such checks the positive/counterfactual/random-masking paradigms could introduce label noise or spurious correlations that inflate both accuracy and consistency metrics.

    Authors: We agree that explicit verification of semantic preservation is valuable to strengthen confidence in the automatic pipeline and rule out potential artifacts or label noise. The language-guided evidence construction pipeline (detailed in §3) is designed to localize question-relevant regions via the MLLM's own reasoning trace and then generate counterfactual variants by editing only those regions (e.g., object replacement or attribute change) while leaving the question text, non-evidence image areas, and overall scene semantics intact. We already include multiple qualitative examples in Figure 3 and Appendix A that illustrate preserved question semantics across positive, counterfactual, and random-masking cases. In addition, the consistent gains on the human-annotated DeFacto-1.5K benchmark—which directly measures evidence-answer alignment—provide indirect support that the training signals are effective rather than spurious. To directly address the referee's concern, we will add a human verification study: we will randomly sample 500 examples from DeFacto-100K, have two independent annotators rate semantic preservation and whether only the targeted evidence was altered, and report agreement rates plus inter-annotator agreement (Cohen's kappa). These results and a description of the protocol will be added to §3 and a new appendix in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external benchmarks and no self-referential derivations

full rationale

The paper describes a data-generation pipeline, GRPO training, and three rewards to improve evidence-answer consistency, with results measured on external benchmarks plus a new human-annotated DeFacto-1.5K set. No equations, first-principles derivations, or fitted parameters are presented that reduce any claimed improvement to an input defined by the same data. The language-guided evidence construction is an engineering step whose validity is an empirical assumption, not a definitional loop or self-citation that forces the outcome. Central claims rest on reported accuracy and consistency gains rather than any reduction to the pipeline outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the automatic pipeline produces high-quality counterfactual examples and that the three rewards in GRPO training reliably promote evidence consistency without introducing new biases.

axioms (1)
  • domain assumption The language-guided evidence construction pipeline accurately localizes question-relevant regions and generates valid counterfactual variants.
    This premise is required for the training data to enforce genuine evidence-answer alignment rather than artifacts of the generation process.

pith-pipeline@v0.9.0 · 5732 in / 1324 out tokens · 31875 ms · 2026-05-22T13:32:03.736879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  2. Semantic-Enriched Latent Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 2 Pith papers · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  4. [4]

    Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

    Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, and Xiaodan Liang. Ground- r1: Incentivizing grounded visual reasoning via reinforcement learning.arXiv preprint arXiv:2505.20272,

  5. [5]

    Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification.arXiv preprint arXiv:1909.02164,

  6. [6]

    GRIT: Teaching MLLMs to Think with Images

    Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879,

  7. [7]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

  8. [8]

    A Survey on Optical Character Recognition System

    Noman Islam, Zeeshan Islam, and Nazia Noor. A survey on optical character recognition system. arXiv preprint arXiv:1710.05703,

  9. [9]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InComputer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 235–251. Springer,

  10. [10]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition.arXiv preprint arXiv:2508.19652,

  11. [11]

    Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025a. Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv...

  12. [12]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

  13. [13]

    Microsoft. Introducing gpt-4o-2024-08-06 api with structured outputs on azure.https://techcommunity.microsoft.com/blog/azure-ai-services-blog/ introducing-gpt-4o-2024-08-06-api-with-structured-outputs-on-azure/4232684,

  14. [14]

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty

    Accessed: 2025-03-07. Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR,

  15. [15]

    Accessed: 2025-08-06

    URLhttps://openai.com/index/thinking-with-images/. Accessed: 2025-08-06. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to fol- low instructions with human feedback.Advances in neural information processing systems, 35: 27730–27744,

  16. [16]

    Compositional Semantic Parsing on Semi-Structured Tables

    Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305,

  17. [17]

    Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

    Maitreya Patel, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Cripp-vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

  18. [18]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824,

  19. [19]

    Cogcom: A visual language model with chain-of-manipulations reasoning

    Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, et al. Cogcom: A visual language model with chain-of-manipulations reasoning. arXiv preprint arXiv:2402.04236,

  20. [20]

    org/abs/2411.14347

    URLhttps://arxiv. org/abs/2411.14347. 12 Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hong- sheng Li. Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642,

  21. [21]

    Visual agents as fast and slow thinkers

    Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, and Dongfang Liu. Visual agents as fast and slow thinkers. arXiv preprint arXiv:2408.08862,

  22. [22]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  24. [24]

    Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works

    Xue Wang, Zhibo Wang, Haiqin Weng, Hengchang Guo, Zhifei Zhang, Lu Jin, Tao Wei, and Kui Ren. Counterfactual-based saliency map: Towards visual contrastive explanations for neural net- works. InProceedings of the IEEE/CVF international conference on computer vision, pp. 2042– 2051,

  25. [25]

    Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a

    Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples.arXiv preprint arXiv:2402.13254, 2024a. Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details...

  26. [26]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    13 Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing” thinking with images” via reinforcement learning.arXiv preprint arXiv:2505.14362,

  27. [27]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592,