Proceedings of the IEEE/CVF international conference on computer vision , pages=

Vivit: A video vision transformer , author=

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

cs.CV · 2023-10-03 · unverdicted · novelty 6.0

LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.

Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition

cs.CV · 2026-05-06 · unverdicted · novelty 3.0

FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.

citing papers explorer

Showing 3 of 3 citing papers.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV · 2026-05-11 · unverdicted · none · ref 57
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment cs.CV · 2023-10-03 · unverdicted · none · ref 139
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
Hybrid Congestion Classification Framework Using Flow-Guided Attention and Empirical Mode Decomposition cs.CV · 2026-05-06 · unverdicted · none · ref 94
FLO-EMD integrates flow-guided attention and EMD on aggregated motion traces to classify light, medium, and heavy congestion at 97.5% accuracy on 1,050 surveillance clips.

Proceedings of the IEEE/CVF international conference on computer vision , pages=

fields

years

verdicts

representative citing papers

citing papers explorer