CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
Videocot: A video chain-of- thought dataset with active annotation tool
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 6verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
citing papers explorer
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning
VTI-CoT proposes a visual-textual interleaved chain-of-thought method for video reasoning, built via automated annotation and OCR compression, claiming SOTA performance and better training efficiency on same-scale models.
-
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.