EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Proceedings of the IEEE/CVF international conference on computer vision , pages=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3verdicts
UNVERDICTED 3representative citing papers
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition
FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.