Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
Llava- next: A strong zero-shot video understanding model, 2024
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3verdicts
UNVERDICTED 3representative citing papers
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
citing papers explorer
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
-
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
-
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.