Movqa: A benchmark of versatile question-answering for long-form movie understanding

Hongjie Zhang, Yi Liu, Lu Dong, Yifei Huang, Zhen-Hua Ling, Yali Wang, Limin Wang, Yu Qiao · 2023 · arXiv 2312.04817

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

representative citing papers

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

TrajTok: Learning Trajectory Tokens enables better Video Understanding

cs.CV · 2026-02-26 · unverdicted · novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

cs.CV · 2024-12-31 · unverdicted · novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

cs.CV · 2023-11-28 · accept · novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

cs.CV · 2025-01-21 · unverdicted · novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject

citing papers explorer

Showing 6 of 6 citing papers.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 65
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
TrajTok: Learning Trajectory Tokens enables better Video Understanding cs.CV · 2026-02-26 · unverdicted · none · ref 83
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 49
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling cs.CV · 2024-12-31 · unverdicted · none · ref 67
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark cs.CV · 2023-11-28 · accept · none · ref 102
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling cs.CV · 2025-01-21 · unverdicted · none · ref 36
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject

Movqa: A benchmark of versatile question-answering for long-form movie understanding

fields

years

verdicts

representative citing papers

citing papers explorer