CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
Llava- next: A strong zero-shot video understanding model
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3verdicts
UNVERDICTED 3representative citing papers
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
citing papers explorer
-
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
-
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
VidHal is a new benchmark that evaluates VLLM temporal hallucinations through a caption ordering task on videos with varying hallucination levels.
-
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.