Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding

Jiang Liu; Ruixiang Tang; Xiaodong Lin; Yuefei Chen

arxiv: 2511.19923 · v2 · pith:HMTTI6MDnew · submitted 2025-11-25 · 💻 cs.CV · cs.CL

Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding

Yuefei Chen , Jiang Liu , Xiaodong Lin , Ruixiang Tang This is my paper

classification 💻 cs.CV cs.CL

keywords reasoningcounterfactualcapabilitycausallanguagemodelsunderstandingvideo

0 comments

read the original abstract

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Vision--Language Models Understand 3D Scenes or Just Catalogue Objects?
cs.CV 2026-05 accept novelty 6.0

VLMs achieve 53-97% on volumetric rearrangement planning but only 6-45% on occlusion and under 7% on reflections in a new 3,034-sample benchmark, with white-box analysis localizing the failure to visual-token merger i...