pith. sign in

arxiv: 2605.23216 · v1 · pith:OVF7F5YVnew · submitted 2026-05-22 · 💻 cs.CV

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords causal reasoningvideo question answeringvision-language modelsspatio-temporal reasoningbenchmarkcausal chainsgrounded reasoningvideo understanding
0
0 comments X

The pith

Vision-language models struggle to construct precise causal chains when answering questions about video events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CaST-Bench to test whether vision-language models can perform cause-and-effect reasoning in videos by locating chains of multiple spatio-temporal pieces of evidence. It builds a dataset of 2,066 questions across 1,015 videos, each paired with annotations that mark the exact temporal segments and bounding-box tracks forming the causal chain. The authors argue that current models perform poorly on causal questions primarily because they cannot build these grounded chains, and they supply new metrics that score both final answers and the quality of the visual evidence used. This setup matters because accurate causal reasoning would cut down on answers based on misleading surface patterns and would let users see the actual steps a model followed.

Core claim

CaST-Bench supplies 2,066 complex causal questions over 1,015 videos in which each question demands that a model identify and localize a chain of multiple spatio-temporal evidences; the chains are annotated via temporal segments and bounding-box tracks created through a human-AI pipeline, and the benchmark includes novel metrics that separately measure answer correctness and the degree of visual-evidence grounding achieved.

What carries the argument

Causal chain annotations that mark the specific temporal segments and bounding-box tracks linking cause to effect in each video.

If this is right

  • Models able to build explicit causal chains will show higher accuracy on causal video questions.
  • Grounded chain construction will reduce answers driven by spurious correlations.
  • Explicit evidence chains will make model outputs more transparent to users.
  • Future vision-language models will need dedicated mechanisms for assembling spatio-temporal causal sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation style could be applied to test causal reasoning in domains such as robotics or medical video analysis.
  • Metrics that separately score grounding may become standard for judging reliability in multimodal systems.
  • Training procedures that reward chain construction rather than final-answer matching could be developed using this benchmark format.

Load-bearing premise

The human-AI pipeline produces annotations that accurately identify the true causal chains in the videos without major bias or localization mistakes.

What would settle it

A vision-language model that reaches high answer accuracy on the benchmark questions while failing to localize or reference the annotated causal segments and boxes, or a large-scale human review that finds frequent mismatches between the provided annotations and the actual causal events in the videos.

Figures

Figures reproduced from arXiv: 2605.23216 by Ashutosh Kumar, Caixin Kang, Hsuan-Kung Yang, Jingjing Pan, Mingfang Zhang, Mustafa Erdogan, Quan Kong, Rajat Saini, Yifei Huang, Yoichi Sato.

Figure 1
Figure 1. Figure 1: CaST-Bench Overview and Example Data. Each QA in the benchmark is paired with a novel spatio-temporal (ST) causal chain. Unlike previous benchmarks, CaST-Bench requires models to actively search for both cause (Vc) and effect (Ve) evidences in order to construct a causal chain for question answering. To excel on CaST-Bench, a model must produce an answer that is not only correct, but also faithfully ground… view at source ↗
Figure 3
Figure 3. Figure 3: Human-AI Collaborative Pipeline for Constructing Causal Chain–Grounded Video QA. Vision-language tools and humans collaborate across detection, description, generation, and filtering stages to produce high-quality {QA, Causal Chain} data. gaged in distinct activities, ensuring complex scenes that pose a challenge for spatio-temporal reasoning. This pro￾cess yields 1,015 qualified videos. Spatio-Temporal Fi… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Question Types. Benchmark Statistics #Videos 1015 #Described instances 10728 #Questions 2066 #Options per QA 6 Avg. #evidences per QA 2.36 Avg. video tem. duration 13.68s Avg. evidence tem. duration 5.65s Avg. evidence spa. coverage 8.3% Avg. #words of questions 21.03 Avg. #words of options 10.86 Avg. #words of evidence 15.50 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Error analysis regarding model vulnerability to dif￾ferent distractor option types on CaST-Bench. Values show how often model selected the distractor option (i.e., the trap rate) for text-based, video-based, and near-miss distractors. 0 1 2 3 4 5 6 7 8 9 10 Logical Consistency Evidence Coverage Overall Justification Answer Correctness 0 10 20 30 40 50 60 70 Text-based Distractor Video-based Distractor Near… view at source ↗
Figure 8
Figure 8. Figure 8: Interface of Human Annotation for Instance Description (Sec. 10.3). The left part displays the video, and on the right there is a text box where the annotator can freely edit or modify the content, based on original description that is AI-generated (Sec. 10.2). Stage 3: Dynamic Description for Capturing Temporal Behavior With a static, contextual description in hand, the final stage is to capture the insta… view at source ↗
Figure 9
Figure 9. Figure 9: CaST-Bench Data Sample 1. Question Type: Causal Explanation - Why questions (reasons) [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CaST-Bench Data Sample 2. Question Type: Causal Explanation - How questions (mechanisms) [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CaST-Bench Data Sample 3. Question Type: Counterfactual Reasoning - Physical counterfactual [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: CaST-Bench Data Sample 4. Question Type: Counterfactual Reasoning - Physical counterfactual [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CaST-Bench Data Sample 5. Question Type: Counterfactual Reasoning - Social counterfactual [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: CaST-Bench Data Sample 6. Question Type: Predictive Anticipation - Behavioral anticipation [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case Studies and Failure Analysis. Successful example. Question: Based on the actions of the person in the light blue-colored top, what is the most likely immediate next action they will perform after 00:10? Gemini-2.5-Pro Evidence #3: 00:07-00:10 After standing, the person turns around and pushes the white chair back under the table, tidying up their space. Evidence #1: 00:00-00:03 The person is initiall… view at source ↗
Figure 16
Figure 16. Figure 16: Case Studies and Failure Analysis. #1 failure example [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Case Studies and Failure Analysis. #2 failure example. Question: If the person in the black puffy jacket had not looked ahead, what would have been the most direct physical consequence? InternVL-3.5 Evidence #1: 00:00-00:05 The person in the black puffy jacket is standing in the path of the silver car. Evidence #2: 00:05-00:10 The red and white barrier arm is descending towards the person. Answer: "D": "T… view at source ↗
Figure 18
Figure 18. Figure 18: Case Studies and Failure Analysis. #3 failure example [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
read the original abstract

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CaST-Bench, a new benchmark with 2,066 causal questions over 1,015 videos. Questions require identifying and localizing multi-step causal chains via temporal segments and bounding-box tracks, constructed through a human-AI collaborative pipeline. Novel metrics evaluate both answer correctness and visual-evidence grounding. Experiments conclude that current VLMs struggle with causal questions primarily because of limited ability to construct precise and grounded causal chains.

Significance. If the annotations prove valid, the benchmark and metrics would usefully isolate causal-chain reasoning from surface perception and spurious correlation, providing a concrete testbed for improving VLM transparency and robustness in video. The fine-grained spatio-temporal grounding annotations are a concrete contribution that could support future work on verifiable reasoning.

major comments (3)
  1. [§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.
  2. [§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.
  3. [Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrase 'high-quality dataset' is used without supporting statistics; move any available agreement or quality numbers from the appendix into the main text.
  2. [Table 1] Table 1 or equivalent: Clarify whether the 2,066 questions are unique or include multiple questions per video, and report the distribution of chain lengths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below. We agree that additional quantitative details and clarifications will strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.

    Authors: We acknowledge the absence of reported quantitative validation metrics. The construction involved iterative human review, but agreement and error rates were not quantified in the text. In revision we will add inter-annotator agreement on a sampled subset, localization error statistics, and verification that annotated chains are minimal and causally sufficient. These additions will support the abstract's attribution. revision: yes

  2. Referee: [§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.

    Authors: The current experiments demonstrate low VLM performance on causal questions using the grounding metrics. To strengthen the isolation claim we will add ablations contrasting causal versus non-causal questions and report correlations between grounding scores and answer accuracy. These analyses will be included in the revised experiments section. revision: yes

  3. Referee: [Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.

    Authors: We will expand the Evaluation Metrics section to explicitly specify the scoring rules for partial chain coverage, including the penalty structure relative to full causal sufficiency. This clarification will demonstrate that low scores primarily reflect reasoning limitations rather than metric artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metrics constructed independently from new annotations and evaluations

full rationale

The paper introduces a new dataset (2,066 questions over 1,015 videos) and novel evaluation metrics via a human-AI pipeline and empirical testing on VLMs. No derivation reduces to fitted inputs, self-definitions, or self-citation chains; the central claims rest on the new annotations and observed model performance rather than any equation or prior result that is redefined or refit within the work. This is the standard case of a benchmark paper whose results are externally falsifiable against the released data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the human-AI pipeline yields accurate causal chain annotations and that the new metrics validly measure grounded causal reasoning.

axioms (1)
  • domain assumption Human-AI collaborative annotation can produce high-quality causal chain labels using temporal segments and bounding-box tracks.
    Invoked to justify dataset construction quality in the abstract.

pith-pipeline@v0.9.0 · 5770 in / 1103 out tokens · 26109 ms · 2026-05-25T04:52:47.911875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 15 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  2. [2]

    Cg-bench: Clue-grounded question answering benchmark for long video understanding

    Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

  3. [3]

    Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024

    Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chao- fan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, et al. Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024. 3

  4. [4]

    Cross-modal causal rela- tion alignment for video question grounding

    Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. Cross-modal causal rela- tion alignment for video question grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24087–24096, 2025. 3

  5. [5]

    Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

    Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, and Xihui Liu. Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025. 2

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2, 6, 13

  7. [7]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3

  8. [8]

    V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025. 2, 3, 5

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 3, 6, 13

  10. [10]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  11. [11]

    Causalvqa: A physically grounded causal reasoning benchmark for video models

    Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Am- mar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv preprint arXiv:2506.09943, 2025. 3

  12. [12]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  14. [14]

    Egoexobench: A benchmark for first-and third-person view video understanding in mllms

    Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 3

  15. [15]

    Causal inference,

    Miguel A Hern ´an and James M Robins. Causal inference,

  16. [16]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 6

  17. [17]

    Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025

    Tongyi Lab. Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025. Model avail- able at Hugging Face. Accessed: 2025-11-12. 2, 6

  18. [18]

    Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2

  19. [19]

    Tvqa+: Spatio-temporal grounding for video question answering

    Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020. 2, 3

  20. [20]

    From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering

    Jiangtong Li, Li Niu, and Liqing Zhang. From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022. 3

  21. [21]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2

  22. [22]

    Mvbench: A comprehensive multi-modal video understand- ing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

  23. [23]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2

  24. [24]

    Llama-vid: An image is worth 2 tokens in large language models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2

  25. [25]

    Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2

  26. [26]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 3

  27. [27]

    Gpt-5.https://openai.com, 2025

    OpenAI. Gpt-5.https://openai.com, 2025. Large language model. 6

  28. [28]

    Causal inference in statistics: An overview

    Judea Pearl. Causal inference in statistics: An overview

  29. [29]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

  30. [30]

    En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024

    Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024. 3

  31. [31]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

  32. [32]

    Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

    Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

  33. [33]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

  34. [34]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6, 13

  35. [35]

    Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

    Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

  36. [36]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

  37. [37]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

  38. [38]

    Can i trust your answer? visually grounded video question answering

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 2, 3

  39. [39]

    Mimo-vl technical report, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6

  40. [40]

    Video question answer- ing via gradually refined attention over appearance and mo- tion

    Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

  41. [41]

    Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

    Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 2

  42. [42]

    Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

    Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, et al. Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025. 3

  43. [43]

    Discovering the real association: Multimodal causal rea- soning in video question answering

    Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal rea- soning in video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19027–19036, 2023. 3

  44. [44]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 3

  45. [45]

    Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

    Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

  46. [46]

    Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2

  47. [47]

    Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 2

  48. [48]

    Llava- next: A strong zero-shot video understanding model, 2024

    Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

  49. [49]

    Where does it exist: Spatio-temporal video grounding for multi-form sentences

    Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10668–10677, 2020. 2, 5

  50. [50]

    Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...

  51. [51]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2 CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering Supplementary Material

  52. [52]

    9: CaST-Bench Samples from Each Category Sec

    Appendix Overview The organization of the appendix is as follows: Sec. 9: CaST-Bench Samples from Each Category Sec. 10: Details of Data Annotation Pipeline Sec. 11: Details of Experiment Setup Sec. 13: Case Studies and Failure Analysis Sec. 14: Social Impact, License, and Access

  53. [53]

    •Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events

    CaST-Bench Samples from Each Category Due to page limitation in the main manuscript, we show more examples regarding all question types, as follows. •Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events. –Why questions (reasons), shown in Fig. 9 –How questions (mechanisms), shown in Fig. 10 •Counterfactua...

  54. [54]

    Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time

    Details of Data Annotation Pipeline 10.1. Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time. As discussed in Sec. 4, carefully curating the raw video pool is essential: studio footage or single-actor clips often lack the competing causal cues and spatial ambiguity needed to str...

  55. [55]

    2) A version of the original image where the background is blurred to isolate the target instance

    The original image where the target instance is marked with a green outline (the outline is an overlay, not part of the object). 2) A version of the original image where the background is blurred to isolate the target instance. Objective: Write exactly one English description that refers only to the target instance and its scene/context in the original im...

  56. [56]

    **Text Description**: A sentence identifying the target object and its surrounding scene and context

  57. [57]

    [start–end]: action

    **Video Clip**: A silent video focused on the target object. The object is highlighted with a green border for tracking purposes. **Task**: Generate a time-stamped log detailing the specific dynamics of the specified object shown in the video. **Rules**: - Source of Truth: The video clip is the source of truth. The text input is for context only. If a fra...

  58. [58]

    Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA

    Details of Experiment Setup 11.1. Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA. For the multiple-choice setting, the exact prompt is provided in Prompt 7. 11.2. VLM Hyperparameter Configuration We configure all VLMs withmax new tokensset to 2048, limiting each sample to at most 2,048 generated to- kens (excluding the in...

  59. [59]

    instances

    Evaluation Suite 12.1. Grounded Causal Chain Evaluation Evaluating the correctness of a predicted causal chain is fundamentally harder than checking a single grounding tar- get. A model must recover every actor that participates in the causal process, align their evidences across different time ranges, and ensure the supporting boxes stay faithful to the ...

  60. [60]

    A causal question about a video event {question}

  61. [61]

    The ground-truth conclusion answer {gt_answer}

  62. [62]

    The ground-truth causal reasoning process {gt_reasoning}

  63. [63]

    A test-taker model’s generated conclusion answer {pred_answer}

  64. [64]

    Assign a separate score (0{10) for each dimension according to the standards below

    A test-taker model’s generated causal reasoning process {pred_reasoning} Task: Evaluate the test-taker model’s reasoning across four dimensions. Assign a separate score (0{10) for each dimension according to the standards below. Evaluation Dimensions:

  65. [65]

    * Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

    Answer Conclusion Correctness * Compare only the model’s conclusion answer to the ground-truth conclusion answer; ignore any reasoning content. * Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

  66. [66]

    * Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer

    Causal Chain Logical Consistency * Evaluate only the generated answer and reasoning; do not reference ground-truth answer or ground-truth reasoning. * Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer. * Identify any logical leaps, post-hoc reasoning, or teleological claims lacking justificatio...

  67. [67]

    Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps

    Evidence Coverage & Completeness * Use the ground-truth causal reasoning as the reference. Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps. * Evaluate recall of essential causal steps and contextual conditions relative to the ground truth. * Penalize missing core components, contradict...

  68. [68]

    answer_conclusion_correctness

    Evidence{Conclusion Overall Justification * Consider the generated answer and the generated reasoning together: does the provided reasoning justify the stated answer, and do both align with the ground truth overall? * Assess the logical coherence from evidence to conclusion, calibration of confidence, and global plausibility. Scoring Standards for each di...

  69. [69]

    How do customers know where to line up?

    Case Studies and Failure Analysis Beyond the quantitative ablation studies and error analysis presented in the main paper, we here perform a qualitative analysis of the case studies and failure patterns exhibited by the evaluated models. Vulnerability to Spurious Visual ConfoundersA core design principle of CaST-Bench is the inclusion of distrac- tors tha...

  70. [70]

    A": "Because the person in the long dark coat told the child to stop moving

    Social Impact, License, and Access 14.1. Broader Impact CaST-Bench advances the field of VLMs by shifting the fo- cus from surface-level perception to deep, grounded causal reasoning, a capability essential for sophisticated video analysis and anticipation tasks. By mandating that mod- els validate their answers with explicit spatio-temporal evi- dence, t...