CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

Ashutosh Kumar; Caixin Kang; Hsuan-Kung Yang; Jingjing Pan; Mingfang Zhang; Mustafa Erdogan; Quan Kong; Rajat Saini; Yifei Huang; Yoichi Sato

arxiv: 2605.23216 · v1 · pith:OVF7F5YVnew · submitted 2026-05-22 · 💻 cs.CV

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

Mingfang Zhang , Jingjing Pan , Ashutosh Kumar , Rajat Saini , Mustafa Erdogan , Hsuan-Kung Yang , Caixin Kang , Yifei Huang

show 2 more authors

Yoichi Sato Quan Kong

This is my paper

Pith reviewed 2026-05-25 04:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords causal reasoningvideo question answeringvision-language modelsspatio-temporal reasoningbenchmarkcausal chainsgrounded reasoningvideo understanding

0 comments

The pith

Vision-language models struggle to construct precise causal chains when answering questions about video events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CaST-Bench to test whether vision-language models can perform cause-and-effect reasoning in videos by locating chains of multiple spatio-temporal pieces of evidence. It builds a dataset of 2,066 questions across 1,015 videos, each paired with annotations that mark the exact temporal segments and bounding-box tracks forming the causal chain. The authors argue that current models perform poorly on causal questions primarily because they cannot build these grounded chains, and they supply new metrics that score both final answers and the quality of the visual evidence used. This setup matters because accurate causal reasoning would cut down on answers based on misleading surface patterns and would let users see the actual steps a model followed.

Core claim

CaST-Bench supplies 2,066 complex causal questions over 1,015 videos in which each question demands that a model identify and localize a chain of multiple spatio-temporal evidences; the chains are annotated via temporal segments and bounding-box tracks created through a human-AI pipeline, and the benchmark includes novel metrics that separately measure answer correctness and the degree of visual-evidence grounding achieved.

What carries the argument

Causal chain annotations that mark the specific temporal segments and bounding-box tracks linking cause to effect in each video.

If this is right

Models able to build explicit causal chains will show higher accuracy on causal video questions.
Grounded chain construction will reduce answers driven by spurious correlations.
Explicit evidence chains will make model outputs more transparent to users.
Future vision-language models will need dedicated mechanisms for assembling spatio-temporal causal sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation style could be applied to test causal reasoning in domains such as robotics or medical video analysis.
Metrics that separately score grounding may become standard for judging reliability in multimodal systems.
Training procedures that reward chain construction rather than final-answer matching could be developed using this benchmark format.

Load-bearing premise

The human-AI pipeline produces annotations that accurately identify the true causal chains in the videos without major bias or localization mistakes.

What would settle it

A vision-language model that reaches high answer accuracy on the benchmark questions while failing to localize or reference the annotated causal segments and boxes, or a large-scale human review that finds frequent mismatches between the provided annotations and the actual causal events in the videos.

Figures

Figures reproduced from arXiv: 2605.23216 by Ashutosh Kumar, Caixin Kang, Hsuan-Kung Yang, Jingjing Pan, Mingfang Zhang, Mustafa Erdogan, Quan Kong, Rajat Saini, Yifei Huang, Yoichi Sato.

**Figure 1.** Figure 1: CaST-Bench Overview and Example Data. Each QA in the benchmark is paired with a novel spatio-temporal (ST) causal chain. Unlike previous benchmarks, CaST-Bench requires models to actively search for both cause (Vc) and effect (Ve) evidences in order to construct a causal chain for question answering. To excel on CaST-Bench, a model must produce an answer that is not only correct, but also faithfully ground… view at source ↗

**Figure 3.** Figure 3: Human-AI Collaborative Pipeline for Constructing Causal Chain–Grounded Video QA. Vision-language tools and humans collaborate across detection, description, generation, and filtering stages to produce high-quality {QA, Causal Chain} data. gaged in distinct activities, ensuring complex scenes that pose a challenge for spatio-temporal reasoning. This process yields 1,015 qualified videos. Spatio-Temporal Fi… view at source ↗

**Figure 5.** Figure 5: Distribution of Question Types. Benchmark Statistics #Videos 1015 #Described instances 10728 #Questions 2066 #Options per QA 6 Avg. #evidences per QA 2.36 Avg. video tem. duration 13.68s Avg. evidence tem. duration 5.65s Avg. evidence spa. coverage 8.3% Avg. #words of questions 21.03 Avg. #words of options 10.86 Avg. #words of evidence 15.50 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Error analysis regarding model vulnerability to different distractor option types on CaST-Bench. Values show how often model selected the distractor option (i.e., the trap rate) for text-based, video-based, and near-miss distractors. 0 1 2 3 4 5 6 7 8 9 10 Logical Consistency Evidence Coverage Overall Justification Answer Correctness 0 10 20 30 40 50 60 70 Text-based Distractor Video-based Distractor Near… view at source ↗

**Figure 8.** Figure 8: Interface of Human Annotation for Instance Description (Sec. 10.3). The left part displays the video, and on the right there is a text box where the annotator can freely edit or modify the content, based on original description that is AI-generated (Sec. 10.2). Stage 3: Dynamic Description for Capturing Temporal Behavior With a static, contextual description in hand, the final stage is to capture the insta… view at source ↗

**Figure 9.** Figure 9: CaST-Bench Data Sample 1. Question Type: Causal Explanation - Why questions (reasons) [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: CaST-Bench Data Sample 2. Question Type: Causal Explanation - How questions (mechanisms) [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: CaST-Bench Data Sample 3. Question Type: Counterfactual Reasoning - Physical counterfactual [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: CaST-Bench Data Sample 4. Question Type: Counterfactual Reasoning - Physical counterfactual [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: CaST-Bench Data Sample 5. Question Type: Counterfactual Reasoning - Social counterfactual [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: CaST-Bench Data Sample 6. Question Type: Predictive Anticipation - Behavioral anticipation [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Case Studies and Failure Analysis. Successful example. Question: Based on the actions of the person in the light blue-colored top, what is the most likely immediate next action they will perform after 00:10? Gemini-2.5-Pro Evidence #3: 00:07-00:10 After standing, the person turns around and pushes the white chair back under the table, tidying up their space. Evidence #1: 00:00-00:03 The person is initiall… view at source ↗

**Figure 16.** Figure 16: Case Studies and Failure Analysis. #1 failure example [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Case Studies and Failure Analysis. #2 failure example. Question: If the person in the black puffy jacket had not looked ahead, what would have been the most direct physical consequence? InternVL-3.5 Evidence #1: 00:00-00:05 The person in the black puffy jacket is standing in the path of the silver car. Evidence #2: 00:05-00:10 The red and white barrier arm is descending towards the person. Answer: "D": "T… view at source ↗

**Figure 18.** Figure 18: Case Studies and Failure Analysis. #3 failure example [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

read the original abstract

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CaST-Bench adds a new video QA dataset with causal chain annotations and grounded metrics, but the human-AI pipeline lacks validation to support the main claim about VLM reasoning deficits.

read the letter

The key takeaway is that CaST-Bench brings a new dataset and metrics for evaluating causal chain-grounded spatio-temporal reasoning in videos, but the claims about why VLMs fail rest on annotations whose quality isn't clearly verified. The paper does something useful by building 2,066 questions over 1,015 videos where causal chains are marked with temporal segments and bounding-box tracks. The human-AI pipeline is used to create these, and they add metrics that check for grounded reasoning beyond just correct answers. This addresses a gap in existing VQA benchmarks that don't focus on multi-step causal evidence. Experiments indicate current VLMs struggle here, which fits with broader observations about their limitations in deeper video understanding. The soft spot is the annotation validity. The stress-test concern holds because there's no mention of inter-annotator agreement or independent checks on whether the chains are accurate and minimal. Without that, it's possible the benchmark noise affects the results more than the models' actual causal reasoning ability. This paper is for researchers in computer vision and multimodal AI who work on video question answering and want better ways to test causal capabilities. Readers looking for new evaluation tools in this area would find it relevant. It should go to peer review because the core idea targets a recognized need, though the methods section will need close scrutiny on the data creation process.

Referee Report

3 major / 2 minor

Summary. The paper introduces CaST-Bench, a new benchmark with 2,066 causal questions over 1,015 videos. Questions require identifying and localizing multi-step causal chains via temporal segments and bounding-box tracks, constructed through a human-AI collaborative pipeline. Novel metrics evaluate both answer correctness and visual-evidence grounding. Experiments conclude that current VLMs struggle with causal questions primarily because of limited ability to construct precise and grounded causal chains.

Significance. If the annotations prove valid, the benchmark and metrics would usefully isolate causal-chain reasoning from surface perception and spurious correlation, providing a concrete testbed for improving VLM transparency and robustness in video. The fine-grained spatio-temporal grounding annotations are a concrete contribution that could support future work on verifiable reasoning.

major comments (3)

[§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.
[§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.
[Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'high-quality dataset' is used without supporting statistics; move any available agreement or quality numbers from the appendix into the main text.
[Table 1] Table 1 or equivalent: Clarify whether the 2,066 questions are unique or include multiple questions per video, and report the distribution of chain lengths.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below. We agree that additional quantitative details and clarifications will strengthen the manuscript and will incorporate them in the revision.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The human-AI pipeline is presented as producing high-quality causal-chain annotations, yet no quantitative validation (inter-annotator agreement, localization error rates, or independent verification that chains are minimal and causally sufficient) is reported. This directly undermines the abstract's central attribution of VLM failures to 'limited ability to construct precise and grounded causal chains,' as benchmark noise cannot be ruled out.

Authors: We acknowledge the absence of reported quantitative validation metrics. The construction involved iterative human review, but agreement and error rates were not quantified in the text. In revision we will add inter-annotator agreement on a sampled subset, localization error statistics, and verification that annotated chains are minimal and causally sufficient. These additions will support the abstract's attribution. revision: yes
Referee: [§5] §5 (Experiments): The claim that failures stem specifically from causal-chain construction requires evidence that the new grounding metrics isolate this factor from general video comprehension or question difficulty; without ablations or correlation analysis between grounding scores and chain accuracy, the attribution remains unsupported.

Authors: The current experiments demonstrate low VLM performance on causal questions using the grounding metrics. To strengthen the isolation claim we will add ablations contrasting causal versus non-causal questions and report correlations between grounding scores and answer accuracy. These analyses will be included in the revised experiments section. revision: yes
Referee: [Evaluation Metrics] Evaluation Metrics section: The novel metrics are introduced to assess 'visual evidence grounded reasoning,' but the manuscript does not define how they penalize or reward partial chain coverage versus full causal sufficiency, leaving open whether low scores reflect reasoning deficits or metric design choices.

Authors: We will expand the Evaluation Metrics section to explicitly specify the scoring rules for partial chain coverage, including the penalty structure relative to full causal sufficiency. This clarification will demonstrate that low scores primarily reflect reasoning limitations rather than metric artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metrics constructed independently from new annotations and evaluations

full rationale

The paper introduces a new dataset (2,066 questions over 1,015 videos) and novel evaluation metrics via a human-AI pipeline and empirical testing on VLMs. No derivation reduces to fitted inputs, self-definitions, or self-citation chains; the central claims rest on the new annotations and observed model performance rather than any equation or prior result that is redefined or refit within the work. This is the standard case of a benchmark paper whose results are externally falsifiable against the released data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the human-AI pipeline yields accurate causal chain annotations and that the new metrics validly measure grounded causal reasoning.

axioms (1)

domain assumption Human-AI collaborative annotation can produce high-quality causal chain labels using temporal segments and bounding-box tracks.
Invoked to justify dataset construction quality in the abstract.

pith-pipeline@v0.9.0 · 5770 in / 1103 out tokens · 26109 ms · 2026-05-25T04:52:47.911875+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 15 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Cg-bench: Clue-grounded question answering benchmark for long video understanding

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

work page arXiv
[3]

Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024

Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chao- fan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, et al. Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024. 3

work page 2024
[4]

Cross-modal causal rela- tion alignment for video question grounding

Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. Cross-modal causal rela- tion alignment for video question grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24087–24096, 2025. 3

work page 2025
[5]

Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, and Xihui Liu. Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025. 2

work page arXiv 2025
[6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025. 2, 3, 5

work page 2025
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 3, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Causalvqa: A physically grounded causal reasoning benchmark for video models

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Am- mar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv preprint arXiv:2506.09943, 2025. 3

work page arXiv 2025
[12]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3

work page 2025
[13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 3

work page arXiv 2025
[15]

Causal inference,

Miguel A Hern ´an and James M Robins. Causal inference,

work page
[16]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 6

work page 2025
[17]

Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025

Tongyi Lab. Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025. Model avail- able at Hugging Face. Accessed: 2025-11-12. 2, 6

work page 2025
[18]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020. 2, 3

work page 2020
[20]

From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering

Jiangtong Li, Li Niu, and Liqing Zhang. From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022. 3

work page 2022
[21]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

work page 2024
[23]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2

work page 2024
[25]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2

work page 2024
[26]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 3

work page arXiv 2025
[27]

Gpt-5.https://openai.com, 2025

OpenAI. Gpt-5.https://openai.com, 2025. Large language model. 6

work page 2025
[28]

Causal inference in statistics: An overview

Judea Pearl. Causal inference in statistics: An overview

work page
[29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024. 3

work page arXiv 2024
[31]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024
[32]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

work page arXiv 2025
[34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

work page arXiv 2024
[36]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

work page 2021
[38]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 2, 3

work page 2024
[39]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6

work page 2025
[40]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

work page 2017
[41]

Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 2

work page arXiv 2025
[42]

Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, et al. Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025. 3

work page arXiv 2025
[43]

Discovering the real association: Multimodal causal rea- soning in video question answering

Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal rea- soning in video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19027–19036, 2023. 3

work page 2023
[44]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 3

work page arXiv 2025
[45]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

work page 2025
[46]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 2

work page arXiv 2025
[48]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

work page 2024
[49]

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10668–10677, 2020. 2, 5

work page 2020
[50]

Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...

work page 2024
[51]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2 CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

9: CaST-Bench Samples from Each Category Sec

Appendix Overview The organization of the appendix is as follows: Sec. 9: CaST-Bench Samples from Each Category Sec. 10: Details of Data Annotation Pipeline Sec. 11: Details of Experiment Setup Sec. 13: Case Studies and Failure Analysis Sec. 14: Social Impact, License, and Access

work page
[53]

•Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events

CaST-Bench Samples from Each Category Due to page limitation in the main manuscript, we show more examples regarding all question types, as follows. •Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events. –Why questions (reasons), shown in Fig. 9 –How questions (mechanisms), shown in Fig. 10 •Counterfactua...

work page
[54]

Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time

Details of Data Annotation Pipeline 10.1. Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time. As discussed in Sec. 4, carefully curating the raw video pool is essential: studio footage or single-actor clips often lack the competing causal cues and spatial ambiguity needed to str...

work page
[55]

2) A version of the original image where the background is blurred to isolate the target instance

The original image where the target instance is marked with a green outline (the outline is an overlay, not part of the object). 2) A version of the original image where the background is blurred to isolate the target instance. Objective: Write exactly one English description that refers only to the target instance and its scene/context in the original im...

work page
[56]

**Text Description**: A sentence identifying the target object and its surrounding scene and context

work page
[57]

[start–end]: action

**Video Clip**: A silent video focused on the target object. The object is highlighted with a green border for tracking purposes. **Task**: Generate a time-stamped log detailing the specific dynamics of the specified object shown in the video. **Rules**: - Source of Truth: The video clip is the source of truth. The text input is for context only. If a fra...

work page
[58]

Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA

Details of Experiment Setup 11.1. Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA. For the multiple-choice setting, the exact prompt is provided in Prompt 7. 11.2. VLM Hyperparameter Configuration We configure all VLMs withmax new tokensset to 2048, limiting each sample to at most 2,048 generated to- kens (excluding the in...

work page 2048
[59]

instances

Evaluation Suite 12.1. Grounded Causal Chain Evaluation Evaluating the correctness of a predicted causal chain is fundamentally harder than checking a single grounding tar- get. A model must recover every actor that participates in the causal process, align their evidences across different time ranges, and ensure the supporting boxes stay faithful to the ...

work page
[60]

A causal question about a video event {question}

work page
[61]

The ground-truth conclusion answer {gt_answer}

work page
[62]

The ground-truth causal reasoning process {gt_reasoning}

work page
[63]

A test-taker model’s generated conclusion answer {pred_answer}

work page
[64]

Assign a separate score (0{10) for each dimension according to the standards below

A test-taker model’s generated causal reasoning process {pred_reasoning} Task: Evaluate the test-taker model’s reasoning across four dimensions. Assign a separate score (0{10) for each dimension according to the standards below. Evaluation Dimensions:

work page
[65]

* Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

Answer Conclusion Correctness * Compare only the model’s conclusion answer to the ground-truth conclusion answer; ignore any reasoning content. * Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

work page
[66]

* Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer

Causal Chain Logical Consistency * Evaluate only the generated answer and reasoning; do not reference ground-truth answer or ground-truth reasoning. * Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer. * Identify any logical leaps, post-hoc reasoning, or teleological claims lacking justificatio...

work page
[67]

Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps

Evidence Coverage & Completeness * Use the ground-truth causal reasoning as the reference. Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps. * Evaluate recall of essential causal steps and contextual conditions relative to the ground truth. * Penalize missing core components, contradict...

work page
[68]

answer_conclusion_correctness

Evidence{Conclusion Overall Justification * Consider the generated answer and the generated reasoning together: does the provided reasoning justify the stated answer, and do both align with the ground truth overall? * Assess the logical coherence from evidence to conclusion, calibration of confidence, and global plausibility. Scoring Standards for each di...

work page
[69]

How do customers know where to line up?

Case Studies and Failure Analysis Beyond the quantitative ablation studies and error analysis presented in the main paper, we here perform a qualitative analysis of the case studies and failure patterns exhibited by the evaluated models. Vulnerability to Spurious Visual ConfoundersA core design principle of CaST-Bench is the inclusion of distrac- tors tha...

work page
[70]

A": "Because the person in the long dark coat told the child to stop moving

Social Impact, License, and Access 14.1. Broader Impact CaST-Bench advances the field of VLMs by shifting the fo- cus from surface-level perception to deep, grounded causal reasoning, a capability essential for sophisticated video analysis and anticipation tasks. By mandating that mod- els validate their answers with explicit spatio-temporal evi- dence, t...

work page

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Cg-bench: Clue-grounded question answering benchmark for long video understanding

Guo Chen, Yicheng Liu, Yifei Huang, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, and Limin Wang. Cg- bench: Clue-grounded question answering benchmark for long video understanding.arXiv preprint arXiv:2412.12075,

work page arXiv

[3] [3]

Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024

Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chao- fan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, et al. Mecd: Unlocking multi-event causal discovery in video reasoning.Advances in Neural Informa- tion Processing Systems, 37:92554–92580, 2024. 3

work page 2024

[4] [4]

Cross-modal causal rela- tion alignment for video question grounding

Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. Cross-modal causal rela- tion alignment for video question grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24087–24096, 2025. 3

work page 2025

[5] [5]

Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025

Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, and Xihui Liu. Exploring the effect of reinforcement learning on video understanding: Insights from seed-bench- r1.arXiv preprint arXiv:2503.24376, 2025. 2

work page arXiv 2025

[6] [6]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 2, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025. 2, 3, 5

work page 2025

[9] [9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 2, 3, 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Causalvqa: A physically grounded causal reasoning benchmark for video models

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Am- mar Rizvi, and Justine T Kao. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv preprint arXiv:2506.09943, 2025. 3

work page arXiv 2025

[12] [12]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 2, 3

work page 2025

[13] [13]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Egoexobench: A benchmark for first-and third-person view video understanding in mllms

Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, and Jiangmiao Pang. Egoexobench: A benchmark for first-and third-person view video understanding in mllms. arXiv preprint arXiv:2507.18342, 2025. 3

work page arXiv 2025

[15] [15]

Causal inference,

Miguel A Hern ´an and James M Robins. Causal inference,

work page

[16] [16]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 6

work page 2025

[17] [17]

Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025

Tongyi Lab. Qwen3-vl: Large multimodal language mod- els by alibaba cloud.https://huggingface.co/ collections/Qwen/qwen3- vl, 2025. Model avail- able at Hugging Face. Accessed: 2025-11-12. 2, 6

work page 2025

[18] [18]

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, and Hengshuang Zhao. Mini-o3: Scaling up reasoning pat- terns and interaction turns for visual search.arXiv preprint arXiv:2509.07969, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Tvqa+: Spatio-temporal grounding for video question answering

Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. Tvqa+: Spatio-temporal grounding for video question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 8211–8225, 2020. 2, 3

work page 2020

[20] [20]

From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering

Jiangtong Li, Li Niu, and Liqing Zhang. From representa- tion to reasoning: Towards both evidence and commonsense reasoning for video question-answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21273–21282, 2022. 3

work page 2022

[21] [21]

VideoChat: Chat-Centric Video Understanding

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Mvbench: A comprehensive multi-modal video understand- ing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understand- ing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195– 22206, 2024. 2, 3

work page 2024

[23] [23]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. In European Conference on Computer Vision, pages 323–340. Springer, 2024. 2

work page 2024

[25] [25]

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 2

work page 2024

[26] [26]

Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025. 3

work page arXiv 2025

[27] [27]

Gpt-5.https://openai.com, 2025

OpenAI. Gpt-5.https://openai.com, 2025. Large language model. 6

work page 2025

[28] [28]

Causal inference in statistics: An overview

Judea Pearl. Causal inference in statistics: An overview

work page

[29] [29]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distilla- tion.arXiv preprint arXiv:2412.01694, 2024. 3

work page arXiv 2024

[31] [31]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024

[32] [32]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space rea- soning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025. 2

work page arXiv 2025

[34] [34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 6, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024

Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, and Xiangmin Xu. Videocot: A video chain-of- thought dataset with active annotation tool.arXiv preprint arXiv:2407.05355, 2024. 3

work page arXiv 2024

[36] [36]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 3

work page 2021

[38] [38]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 2, 3

work page 2024

[39] [39]

Mimo-vl technical report, 2025

LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. 6

work page 2025

[40] [40]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM international confer- ence on Multimedia, pages 1645–1653, 2017. 3

work page 2017

[41] [41]

Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang, Anna Korhonen, and Ivan Vuli´c. Visual planning: Let’s think only with images.arXiv preprint arXiv:2505.11409, 2025. 2

work page arXiv 2025

[42] [42]

Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025

Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, et al. Vrbench: A benchmark for multi-step reasoning in long nar- rative videos.arXiv preprint arXiv:2506.10857, 2025. 3

work page arXiv 2025

[43] [43]

Discovering the real association: Multimodal causal rea- soning in video question answering

Chuanqi Zang, Hanqing Wang, Mingtao Pei, and Wei Liang. Discovering the real association: Multimodal causal rea- soning in video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19027–19036, 2023. 3

work page 2023

[44] [44]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool- augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416, 2025. 3

work page arXiv 2025

[45] [45]

Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought

Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3

work page 2025

[46] [46]

Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xi- aowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl.arXiv preprint arXiv:2505.15436, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 2

work page arXiv 2025

[48] [48]

Llava- next: A strong zero-shot video understanding model, 2024

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava- next: A strong zero-shot video understanding model, 2024. 2, 6

work page 2024

[49] [49]

Where does it exist: Spatio-temporal video grounding for multi-form sentences

Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. Where does it exist: Spatio-temporal video grounding for multi-form sentences. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 10668–10677, 2020. 2, 5

work page 2020

[50] [50]

Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language mod- els. InProceedings of the 62nd Annual Meeting of the As- sociation for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Lin...

work page 2024

[51] [51]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deep- eyes: Incentivizing” thinking with images” via reinforce- ment learning.arXiv preprint arXiv:2505.14362, 2025. 2 CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering Supplementary Material

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

9: CaST-Bench Samples from Each Category Sec

Appendix Overview The organization of the appendix is as follows: Sec. 9: CaST-Bench Samples from Each Category Sec. 10: Details of Data Annotation Pipeline Sec. 11: Details of Experiment Setup Sec. 13: Case Studies and Failure Analysis Sec. 14: Social Impact, License, and Access

work page

[53] [53]

•Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events

CaST-Bench Samples from Each Category Due to page limitation in the main manuscript, we show more examples regarding all question types, as follows. •Causal ExplanationQuestions that explain the reasons (why) or mechanisms (how) behind actions or events. –Why questions (reasons), shown in Fig. 9 –How questions (mechanisms), shown in Fig. 10 •Counterfactua...

work page

[54] [54]

Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time

Details of Data Annotation Pipeline 10.1. Video Selection Our benchmark targets causal reasoning in realistic, clut- tered scenes where multiple actors interact over time. As discussed in Sec. 4, carefully curating the raw video pool is essential: studio footage or single-actor clips often lack the competing causal cues and spatial ambiguity needed to str...

work page

[55] [55]

2) A version of the original image where the background is blurred to isolate the target instance

The original image where the target instance is marked with a green outline (the outline is an overlay, not part of the object). 2) A version of the original image where the background is blurred to isolate the target instance. Objective: Write exactly one English description that refers only to the target instance and its scene/context in the original im...

work page

[56] [56]

**Text Description**: A sentence identifying the target object and its surrounding scene and context

work page

[57] [57]

[start–end]: action

**Video Clip**: A silent video focused on the target object. The object is highlighted with a green border for tracking purposes. **Task**: Generate a time-stamped log detailing the specific dynamics of the specified object shown in the video. **Rules**: - Source of Truth: The video clip is the source of truth. The text input is for context only. If a fra...

work page

[58] [58]

Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA

Details of Experiment Setup 11.1. Evaluation Prompt All evaluated VLMs shared a single unified prompt for video QA. For the multiple-choice setting, the exact prompt is provided in Prompt 7. 11.2. VLM Hyperparameter Configuration We configure all VLMs withmax new tokensset to 2048, limiting each sample to at most 2,048 generated to- kens (excluding the in...

work page 2048

[59] [59]

instances

Evaluation Suite 12.1. Grounded Causal Chain Evaluation Evaluating the correctness of a predicted causal chain is fundamentally harder than checking a single grounding tar- get. A model must recover every actor that participates in the causal process, align their evidences across different time ranges, and ensure the supporting boxes stay faithful to the ...

work page

[60] [60]

A causal question about a video event {question}

work page

[61] [61]

The ground-truth conclusion answer {gt_answer}

work page

[62] [62]

The ground-truth causal reasoning process {gt_reasoning}

work page

[63] [63]

A test-taker model’s generated conclusion answer {pred_answer}

work page

[64] [64]

Assign a separate score (0{10) for each dimension according to the standards below

A test-taker model’s generated causal reasoning process {pred_reasoning} Task: Evaluate the test-taker model’s reasoning across four dimensions. Assign a separate score (0{10) for each dimension according to the standards below. Evaluation Dimensions:

work page

[65] [65]

* Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

Answer Conclusion Correctness * Compare only the model’s conclusion answer to the ground-truth conclusion answer; ignore any reasoning content. * Judge semantic equivalence, polarity, entity/attribute correctness, and numeric/unit consistency; penalize contradictions, material vagueness, or hedging that alters commitment

work page

[66] [66]

* Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer

Causal Chain Logical Consistency * Evaluate only the generated answer and reasoning; do not reference ground-truth answer or ground-truth reasoning. * Verify that causes precede effects and that the causal sequence is minimal yet sufficient to explain the answer. * Identify any logical leaps, post-hoc reasoning, or teleological claims lacking justificatio...

work page

[67] [67]

Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps

Evidence Coverage & Completeness * Use the ground-truth causal reasoning as the reference. Check that the generated causal reasoning includes and aligns with its key entities, events, moments, and causal steps. * Evaluate recall of essential causal steps and contextual conditions relative to the ground truth. * Penalize missing core components, contradict...

work page

[68] [68]

answer_conclusion_correctness

Evidence{Conclusion Overall Justification * Consider the generated answer and the generated reasoning together: does the provided reasoning justify the stated answer, and do both align with the ground truth overall? * Assess the logical coherence from evidence to conclusion, calibration of confidence, and global plausibility. Scoring Standards for each di...

work page

[69] [69]

How do customers know where to line up?

Case Studies and Failure Analysis Beyond the quantitative ablation studies and error analysis presented in the main paper, we here perform a qualitative analysis of the case studies and failure patterns exhibited by the evaluated models. Vulnerability to Spurious Visual ConfoundersA core design principle of CaST-Bench is the inclusion of distrac- tors tha...

work page

[70] [70]

A": "Because the person in the long dark coat told the child to stop moving

Social Impact, License, and Access 14.1. Broader Impact CaST-Bench advances the field of VLMs by shifting the fo- cus from surface-level perception to deep, grounded causal reasoning, a capability essential for sophisticated video analysis and anticipation tasks. By mandating that mod- els validate their answers with explicit spatio-temporal evi- dence, t...

work page