EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Proceedings of the IEEE/CVF international conference on computer vision , pages=
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC benchmarks.
FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC benchmarks.
-
Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition
FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.