MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
Something Something
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
TimeProVe proposes a propose-then-verify framework using lightweight action-based candidate evidence generation followed by targeted VLM verification for efficient long video temporal reasoning, achieving 7.3% improvement on OTB with 75% fewer VLM calls.
Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.
MTCL learns multi-scale temporal correlations in videos via contrastive learning to produce more informative representations that improve sample efficiency and performance in downstream RL tasks.
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
citing papers explorer
-
Learning to Deny: Action Denial in Multimodal Large Language Models
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
-
TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living
TimeProVe proposes a propose-then-verify framework using lightweight action-based candidate evidence generation followed by targeted VLM verification for efficient long video temporal reasoning, achieving 7.3% improvement on OTB with 75% fewer VLM calls.